Reputation: 61
I am new to data scraping, but I don't ask this question carelessly without digging around for a suitable answer.
I want to download the table from this page: https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje.
As you can see from the following screenshot, there are a couple of select/option on the top of the table. The corresponding html code (on the right) shows that the second half (2) and year 2021 are selected. By re-selecting and resubmit the form, content of the table changes, but the url remains unchanged. However, the changes are reflected in the html code. See the second following screenshot, wherein options are modified into 1 and 2018.
Based on these inspections, I've put together a python script (using bs4 and requests_html) to get the initial page, modify select/option, then post them back to the url. See below for the code. However, it fails its task. The webpage doesn't response to the modification. Could anyone kindly shed some lights on it?
Thanks in advance,
Liang
from bs4 import BeautifulSoup
from requests_html import HTMLSession
from urllib.parse import urljoin
url = "https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje#"
# initialize an HTTP session
session = HTMLSession()
# Get request
res = session.get(url)
# for javascript driven website
# res.html.render()
soup = BeautifulSoup(res.html.html, "html.parser")
# Get all select tags
selects = soup.find_all("select")
# Modify select tags
# Select the first half of a year
selects[0].contents[1].attrs['selected']=''
del selects[0].contents[3].attrs['selected']
# Put into a dictionary
data = {}
data[selects[0]['name']] = selects[0]
data[selects[1]['name']] = selects[1]
# Post it back to the website
res = session.post(url, data=data)
# Remake the soup after the modification
soup = BeautifulSoup(res.content, "html.parser")
# the below code is only for replacing relative URLs to absolute ones
for link in soup.find_all("link"):
try:
link.attrs["href"] = urljoin(url, link.attrs["href"])
except:
pass
for script in soup.find_all("script"):
try:
script.attrs["src"] = urljoin(url, script.attrs["src"])
except:
pass
for img in soup.find_all("img"):
try:
img.attrs["src"] = urljoin(url, img.attrs["src"])
except:
pass
for a in soup.find_all("a"):
try:
a.attrs["href"] = urljoin(url, a.attrs["href"])
except:
pass
# write the page content to a file
open("page.html", "w").write(str(soup))
Upvotes: 0
Views: 2249
Reputation: 28565
The option can be made through a POST and passing in the semestre
and ano
as parameters. For example:
import pandas as pd
import requests
semestre = 1
ano = 2018
url = 'https://www.portodemanaus.com.br/?pagina=nivel-do-rio-negro-hoje'
payload = {
'semestre': '%s' %semestre,
'ano': '%s' %ano,
'buscar': 'Buscar'}
response = requests.post(url, params=payload)
df = pd.read_html(response.text)[7]
Output:
print(df)
0 1 ... 11 12
0 Dias Julho ... Dezembro Dezembro
1 Dias Cota (m) ... Cota (m) Encheu/ Vazou (cm)
2 1 2994 ... 000 000
3 2 2991 ... 000 000
4 3 2989 ... 000 000
5 4 2988 ... 000 000
6 5 2987 ... 000 000
7 6 2985 ... 000 000
8 7 2983 ... 000 000
9 8 2980 ... 000 000
10 9 2977 ... 000 000
11 10 2975 ... 000 000
12 11 2972 ... 000 000
13 12 2969 ... 000 000
14 13 2967 ... 000 000
15 14 2965 ... 000 000
16 15 2962 ... 000 000
17 16 2959 ... 000 000
18 17 2955 ... 000 000
19 18 2951 ... 000 000
20 19 2946 ... 000 000
21 20 2942 ... 000 000
22 21 2939 ... 000 000
23 22 2935 ... 000 000
24 23 2931 ... 000 000
25 24 2927 ... 000 000
26 25 2923 ... 000 000
27 26 2918 ... 000 000
28 27 2912 ... 000 000
29 28 2908 ... 000 000
30 29 2902 ... 000 000
31 30 2896 ... 000 000
32 31 2892 ... 000 000
33 Estatísticas Encheu ... Estável Estável
34 Estatísticas Vazou ... Estável Estável
35 Estatísticas Mínima ... Mínima 000
36 Estatísticas Média ... Média 000
37 Estatísticas Máxima ... Máxima 000
[38 rows x 13 columns]
Upvotes: 3