I’ve been scraping websites with BeautifulSoup for several years, but not always using the Requests library.
Old way:
from urllib.request import urlopen from bs4 import BeautifulSoup url = "https://weimergeeks.com/index.html" html = urlopen(url) soup = BeautifulSoup(html, "html.parser")
New way:
import requests from bs4 import BeautifulSoup url = "https://weimergeeks.com/index.html" html = requests.get(url) soup = BeautifulSoup(html.text, 'html.parser')
So they are really similar, but it turns out that the Requests library offers us two choices for html.text
— instead, we could use html.content
— so what’s the diff, and does it matter?
As usual, it’s Stack Overflow to the rescue. html.text
will be the normal, usual choice. It gives us the content of the HTTP response in unicode, which will suit probably 99.9 percent of all requests. html.content
would give us the content of the HTTP response in bytes — meaning raw. We would choose that for a non-HTML file, such as a PDF or an image.