Check Your Understanding 1
First, import the urlopen function from the urlib.request module:
from urllib.request import urlopen
Then open the URL and use the .read()
method of the HTTPResponse
object returned by urlopen()
to read the page’s HTML:
url = "http://olympus.realpython.org/profiles/dionysus"
html_page = urlopen(url)
html_text = html_page.read().decode("utf-8")
.read()
returns a byte string, so you use .decode()
to decode the bytes using the UTF-8
encoding.
Now that you have the HTML source of the web page as a string assigned to the html_text variable, you can extract Dionysus’s name and favorite color from his profile. The structure of the HTML for Dionysus’s profile is the same as Aphrodite’s profile that you saw earlier.
You can get the name by finding the string “Name:” in the text and extracting everything that comes after the first occurence of the string and before the next HTML tag. That is, you need to extract everything after the colon (:
) and before the first angle bracket (<
). You can use the same technique to extract the favorite color.
The following for loop extracts this text for both the name and favorite color:
for string in ["Name: ", "Favorite Color:"]:
string_start_idx = html_text.find(string)
text_start_idx = string_start_idx + len(string)
next_html_tag_offset = html_text[text_start_idx:].find("<")
text_end_idx = text_start_idx + next_html_tag_offset
raw_text = html_text[text_start_idx : text_end_idx]
clean_text = raw_text.strip(" \r\n\t")
print(clean_text)
Dionysus
Wine
It looks like there’s a lot going on in this forloop, but it’s just a little bit of arithmetic to calculate the right indices for extracting the desired text. Let’s break it down:
- You use
html_text.find()
to find the starting index of the string, either"Name:"
or"Favorite Color:"
, and then assign the index to string_start_idx. - Since the text to extract starts just after the colon in
"Name:"
or"Favorite Color:"
, you get the index of the the character immediately after the colon by adding the length of the string tostart_string_idx
and assign the result totext_start_idx
. - You calculate the ending index of the text to extract by determining the index of the first angle bracket (
<
) relative to text_start_idx and assign this value tonext_html_tag_offset
. Then you add that value totext_start_idx
and assign the result totext_end_idx
. - You extract the text by slicing html_text from
text_start_idx
totext_end_idx
and assign this string toraw_text
. - You remove any whitespace from the beginning and end of
raw_text
using.strip()
and assign the result to clean_text.
At the end of the loop, you use print() to display the extracted text. The final output looks like this:
Dionysus
Wine
This solution is one of many that solves this problem, so if you got the same output with a different solution, then you did great!
Check Your Understanding 2
First, import the urlopen function from the urlib.request module and the BeautifulSoup class from the bs4 package:
from urllib.request import urlopen
from bs4 import BeautifulSoup
Each link URL on the /profiles page is a relative URL, so create a base_url variable with the base URL of the website:
base_url = "http://olympus.realpython.org"
You can build a full URL by concatenating base_url with a relative URL.
Now open the /profiles
page with urlopen()
and use .read()
to get the HTML source:
html_page = urlopen(base_url + "/profiles")
html_text = html_page.read().decode("utf-8")
With the HTML source downloaded and decoded, you can create a new BeautifulSoup object to parse the HTML:
soup = BeautifulSoup(html_text, "html.parser")
soup.find_all("a")
returns a list of all links in the HTML source. You can loop over this list to print out all the links on the webpage:
for link in soup.find_all("a"):
link_url = base_url + link["href"]
print(link_url)
http://olympus.realpython.org/profiles/aphrodite
http://olympus.realpython.org/profiles/poseidon
http://olympus.realpython.org/profiles/dionysus
The relative URL for each link can be accessed through the "href"
subscript. Concatenate this value with base_url
to create the full link_url
.
Check Your Understanding 3
First, import the mechanicalsoup package and create a Broswer object:
import mechanicalsoup
browser = mechanicalsoup.Browser()
Point the browser to the login page by passing the URL to browser.get()
and grab the HTML with the .soup
attribute:
login_url = "http://olympus.realpython.org/login"
login_page = browser.get(login_url)
login_html = login_page.soup
login_html
is a BeautifulSoup instance. Since the page has only a single form on it, you can access the form via login_html.form
. Using .select()
, select the username and password inputs and fill them with the username “zeus” and the password “ThunderDude”:
form = login_html.form
form.select("input")[0]["value"] = "zeus"
form.select("input")[1]["value"] = "ThunderDude"
Now that the form is filled out, you can submit it with browser.submit()
:
profiles_page = browser.submit(form, login_page.url)
If you filled the form with the correct username and password, then profiles_page should actually point to the /profiles
page. You can confirm this by printing the title of the page assigned to profiles_page
:
print(profiles_page.soup.title)
<title>All Profiles</title>
You should see the following text displayed:
<title>All Profiles</title>
If instead you see the text Log In or something else, then the form submission failed.