Reputation: 97
My work is basically:
-Entering in this website "https://aplicacoes.mds.gov.br/sagirmps/estrutura_fisica/preenchimento_municipio_cras_new1.php"
-Fill the 2 forms (with AC - Acre
and Bujari
, for example)
-Click on "Dados Detalhados"(detailed data) in the last column of the table generated. (When you click on "Dados Detalhados", it will generate a second table with the data of 1 month per row).
-Access the data generated by the second table clicking in "Visualizar Relatório" in the last column of each row. <---- THATS the data I'm trying to scrape. But it is a dynamic website and I can't get the data just accessing the url2
(when you click in 'Visualizar relatório' the website returns to the initial url but with the tables I want to scrape). Here is the code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://aplicacoes.mds.gov.br/sagirmps/estrutura_fisica/preenchimento_municipio_cras_new1.php'
params ={
'uf_ibge': '12',
'nome_estado': 'AC - Acre'
'p_ibge': '1200138'
'nome_municipio': 'Bujari'
}
r = requests.post(url, params = params, verify = False)
soup = BeautifulSoup(r.text, "lxml")
tables = pd.read_html(r.text)
unidades = tables[1]
print(unidades)
url2 = 'http://aplicacoes.mds.gov.br/sagirmps/estrutura_fisica/rel_preenchidos_cras.php?&p_id_cras=12001301971'
params2 ={
'p_id_cras': '12001301971'
'mes_referencia': '2019-02-01'
}
r2 = requests.post(url2, json = params2, verify = False)
soup2 = BeautifulSoup(r2.text, 'lxml')
soup2
Note that url2
is the url generated when you click in "Dados Detalhados" and it has the 'p_id_cras' as the second dictionary.
params2
should be the dict used to scrape that data I'm talking about. I've tried the commands params
, data
and json
in the second post request, but none of them works.
Upvotes: 1
Views: 65
Reputation: 142641
url2
should use GET
without parameters.
And then you have page with table with links which have href="javascript:"
but also onclick='enviadados(12001301971,"2019-02-01")'
so you have your parameters for next request.
Last request uses POST
with parameters 12001301971,2019-02-01
and url
https://aplicacoes.mds.gov.br/sagirmps/estrutura_fisica/visualiza_preenchimento_cras.php'`
My code. I hope it works correclty.
import requests
from bs4 import BeautifulSoup
import pandas as pd
base = 'http://aplicacoes.mds.gov.br/sagirmps/estrutura_fisica/'
url = base + 'preenchimento_municipio_cras_new1.php'
#print('url:', url)
params ={
'uf_ibge': '12',
'nome_estado': 'AC - Acre',
'p_ibge': '1200138',
'nome_municipio': 'Bujari' ,
}
r = requests.post(url, params=params, verify=False)
soup1 = BeautifulSoup(r.text, "lxml")
tables = pd.read_html(r.text)
#unidades = tables[1]
#print(unidades)
all_td1 = soup1.find('table', class_="panel-body").find_all('td')
#print('len(all_td1):', len(all_td1))
for td1 in all_td1:
all_a1 = td1.find_all('a')[:1]
#print('len(all_a1):', len(all_a1))
for a1 in all_a1:
url = base + a1['href']
print('url:', url)
r = requests.get(url, verify=False)
soup2 = BeautifulSoup(r.text, "lxml")
#print(soup.text)
all_td2 = soup2.find('table', class_="panel-body").find_all('td')
#print('len(all_td2):', len(all_td2))
for td2 in all_td2:
all_a2 = td2.find_all('a')
#print('len(all_a2):', len(all_a2))
for a2 in all_a2:
print('onclick:', a2['onclick'])
params = {
'p_id_cras': a2['onclick'][11:22], #'12001301971',
'mes_referencia': a2['onclick'][24:-2], #'2019-02-01',
}
print(params)
url = 'https://aplicacoes.mds.gov.br/sagirmps/estrutura_fisica/visualiza_preenchimento_cras.php'
r = requests.post(url, params=params, verify=False)
soup = BeautifulSoup(r.text, "lxml")
all_table = soup.find_all('table')
print(all_table)
Upvotes: 1