Link Search Menu Expand Document

Check Your Understanding 1

First, import the urlopen function from the urlib.request module:

from urllib.request import urlopen

Then open the URL and use the .read() method of the HTTPResponse object returned by urlopen() to read the page’s HTML:

url = "http://olympus.realpython.org/profiles/dionysus"
html_page = urlopen(url)
html_text = html_page.read().decode("utf-8")

.read() returns a byte string, so you use .decode() to decode the bytes using the UTF-8 encoding.

Now that you have the HTML source of the web page as a string assigned to the html_text variable, you can extract Dionysus’s name and favorite color from his profile. The structure of the HTML for Dionysus’s profile is the same as Aphrodite’s profile that you saw earlier.

You can get the name by finding the string “Name:” in the text and extracting everything that comes after the first occurence of the string and before the next HTML tag. That is, you need to extract everything after the colon (:) and before the first angle bracket (<). You can use the same technique to extract the favorite color.

The following for loop extracts this text for both the name and favorite color:

for string in ["Name: ", "Favorite Color:"]:
    string_start_idx = html_text.find(string)
    text_start_idx = string_start_idx + len(string)

    next_html_tag_offset = html_text[text_start_idx:].find("<")
    text_end_idx = text_start_idx + next_html_tag_offset

    raw_text = html_text[text_start_idx : text_end_idx]
    clean_text = raw_text.strip(" \r\n\t")
    print(clean_text)
Dionysus
Wine

It looks like there’s a lot going on in this forloop, but it’s just a little bit of arithmetic to calculate the right indices for extracting the desired text. Let’s break it down:

  1. You use html_text.find() to find the starting index of the string, either "Name:" or "Favorite Color:", and then assign the index to string_start_idx.
  2. Since the text to extract starts just after the colon in "Name:" or "Favorite Color:", you get the index of the the character immediately after the colon by adding the length of the string to start_string_idx and assign the result to text_start_idx.
  3. You calculate the ending index of the text to extract by determining the index of the first angle bracket (<) relative to text_start_idx and assign this value to next_html_tag_offset. Then you add that value to text_start_idx and assign the result to text_end_idx.
  4. You extract the text by slicing html_text from text_start_idx to text_end_idx and assign this string to raw_text.
  5. You remove any whitespace from the beginning and end of raw_text using .strip() and assign the result to clean_text.

At the end of the loop, you use print() to display the extracted text. The final output looks like this:

Dionysus
Wine

This solution is one of many that solves this problem, so if you got the same output with a different solution, then you did great!

Check Your Understanding 2

First, import the urlopen function from the urlib.request module and the BeautifulSoup class from the bs4 package:

from urllib.request import urlopen
from bs4 import BeautifulSoup

Each link URL on the /profiles page is a relative URL, so create a base_url variable with the base URL of the website:

base_url = "http://olympus.realpython.org"

You can build a full URL by concatenating base_url with a relative URL.

Now open the /profiles page with urlopen() and use .read() to get the HTML source:

html_page = urlopen(base_url + "/profiles")
html_text = html_page.read().decode("utf-8")

With the HTML source downloaded and decoded, you can create a new BeautifulSoup object to parse the HTML:

soup = BeautifulSoup(html_text, "html.parser")

soup.find_all("a") returns a list of all links in the HTML source. You can loop over this list to print out all the links on the webpage:

for link in soup.find_all("a"):
    link_url = base_url + link["href"]
    print(link_url)
http://olympus.realpython.org/profiles/aphrodite
http://olympus.realpython.org/profiles/poseidon
http://olympus.realpython.org/profiles/dionysus

The relative URL for each link can be accessed through the "href" subscript. Concatenate this value with base_url to create the full link_url.

Check Your Understanding 3

First, import the mechanicalsoup package and create a Broswer object:

import mechanicalsoup

browser = mechanicalsoup.Browser()

Point the browser to the login page by passing the URL to browser.get() and grab the HTML with the .soup attribute:

login_url = "http://olympus.realpython.org/login"
login_page = browser.get(login_url)
login_html = login_page.soup

login_html is a BeautifulSoup instance. Since the page has only a single form on it, you can access the form via login_html.form. Using .select(), select the username and password inputs and fill them with the username “zeus” and the password “ThunderDude”:

form = login_html.form
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"

Now that the form is filled out, you can submit it with browser.submit():

profiles_page = browser.submit(form, login_page.url)

If you filled the form with the correct username and password, then profiles_page should actually point to the /profiles page. You can confirm this by printing the title of the page assigned to profiles_page:

print(profiles_page.soup.title)
<title>All Profiles</title>

You should see the following text displayed:

<title>All Profiles</title>

If instead you see the text Log In or something else, then the form submission failed.