Scraping details

I’ve been scraping websites with BeautifulSoup for several years, but not always using the Requests library.

Old way:

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://weimergeeks.com/index.html"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")

New way:

import requests
from bs4 import BeautifulSoup
url = "https://weimergeeks.com/index.html"
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')

So they are really similar, but it turns out that the Requests library offers us two choices for html.text — instead, we could use html.content — so what’s the diff, and does it matter?

As usual, it’s Stack Overflow to the rescue. html.text will be the normal, usual choice. It gives us the content of the HTTP response in unicode, which will suit probably 99.9 percent of all requests. html.content would give us the content of the HTTP response in bytes — meaning raw. We would choose that for a non-HTML file, such as a PDF or an image.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.