I’ve been scraping websites with BeautifulSoup for several years, but not always using the Requests library.
from urllib.request import urlopen from bs4 import BeautifulSoup url = "https://weimergeeks.com/index.html" html = urlopen(url) soup = BeautifulSoup(html, "html.parser")
import requests from bs4 import BeautifulSoup url = "https://weimergeeks.com/index.html" html = requests.get(url) soup = BeautifulSoup(html.text, 'html.parser')
So they are really similar, but it turns out that the Requests library offers us two choices for
html.text — instead, we could use
html.content — so what’s the diff, and does it matter?
As usual, it’s Stack Overflow to the rescue.
html.text will be the normal, usual choice. It gives us the content of the HTTP response in unicode, which will suit probably 99.9 percent of all requests.
html.content would give us the content of the HTTP response in bytes — meaning raw. We would choose that for a non-HTML file, such as a PDF or an image.