Scraping web data with select/option using request_html and BeautifulSoup in Python3

Question

I am new to data scraping, but I don't ask this question carelessly without digging around for a suitable answer.

I want to download the table from this page: https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje.

As you can see from the following screenshot, there are a couple of select/option on the top of the table. The corresponding html code (on the right) shows that the second half (2) and year 2021 are selected. By re-selecting and resubmit the form, content of the table changes, but the url remains unchanged. However, the changes are reflected in the html code. See the second following screenshot, wherein options are modified into 1 and 2018.

Based on these inspections, I've put together a python script (using bs4 and requests_html) to get the initial page, modify select/option, then post them back to the url. See below for the code. However, it fails its task. The webpage doesn't response to the modification. Could anyone kindly shed some lights on it?

Thanks in advance,

Liang

from bs4 import BeautifulSoup
from requests_html import HTMLSession
from urllib.parse import urljoin

url = "https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje#"

# initialize an HTTP session
session = HTMLSession()

# Get request
res = session.get(url)

# for javascript driven website
# res.html.render()
soup = BeautifulSoup(res.html.html, "html.parser")

# Get all select tags
selects = soup.find_all("select")

# Modify select tags
# Select the first half of a year
selects[0].contents[1].attrs['selected']=''
del selects[0].contents[3].attrs['selected']

# Put into a dictionary
data = {}
data[selects[0]['name']] = selects[0]
data[selects[1]['name']] = selects[1]

# Post it back to the website
res = session.post(url, data=data)

# Remake the soup after the modification
soup = BeautifulSoup(res.content, "html.parser")

# the below code is only for replacing relative URLs to absolute ones
for link in soup.find_all("link"):
    try:
        link.attrs["href"] = urljoin(url, link.attrs["href"])
    except:
        pass
for script in soup.find_all("script"):
    try:
        script.attrs["src"] = urljoin(url, script.attrs["src"])
    except:
        pass
for img in soup.find_all("img"):
    try:
        img.attrs["src"] = urljoin(url, img.attrs["src"])
    except:
        pass
for a in soup.find_all("a"):
    try:
        a.attrs["href"] = urljoin(url, a.attrs["href"])
    except:
        pass

# write the page content to a file
open("page.html", "w").write(str(soup))

Scraping web data with select/option using request_html and BeautifulSoup in Python3

Answers (1)

Related Questions