Reputation: 1940
I have 2 separated questions:
Question 1
I'm trying to scrape some tables from this website. See the attached image below.
So, I made this code until here:
from bs4 import BeautifulSoup
import requests
url = 'https://transparencia.registrocivil.org.br/registros'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
source = requests.get(url, headers=headers).text
soup = BeautifulSoup(source, 'html.parser')
table = soup.find('table')
print(table.prettify())
This code isn't working and the table
returned is NoneType
. It seems that BeautifulSoup can't find it. What am I doing wrong to scrape the table?
Once this is done, I'll explain the second part of my question:
Question 2
My main idea is to scrape data using the selectors from the image, referring each year, month, region, state to scrape city-data.
Some of those tables are large and are distributed into pages, as you can see at the end of some tables in the website. How could I run all of those pages to get data all together for each year, month, region and state?
Upvotes: 0
Views: 829
Reputation: 10819
You can get the data you want without BeautifulSoup or Selenium.
If you open Google Chrome's Developer Console, and log your network traffic - and filter the log to only view XHR resources, you will see that your browser makes requests to a web API, the response of which is JSON containing all the data you could ever want.
Looking at the requests closer, the API only accepts these requests if the request headers contain a valid User-Agent
field and an XSRF token, which is just a cookie.
So, you have to:
Set-Cookie
in the response headers.Code:
user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"
def get_cookie():
import requests
import re
url = "https://transparencia.registrocivil.org.br/registros"
headers = {
"User-Agent": user_agent
}
response = requests.get(url, headers=headers)
response.raise_for_status()
return requests.utils.unquote(re.match("XSRF-TOKEN=([^;]+)", response.headers["Set-Cookie"]).group())
def get_states(cookie):
import requests
url = "https://transparencia.registrocivil.org.br/api/cities"
headers = {
"User-Agent": user_agent,
"X-XSRF-TOKEN": cookie
}
response = requests.get(url, headers=headers)
response.raise_for_status()
return set(city["uf"] for city in response.json()["cities"])
def get_next_state_results(cookie):
import requests
url = "https://transparencia.registrocivil.org.br/api/record/filter-all"
headers = {
"User-Agent": user_agent,
"X-XSRF-TOKEN": cookie
}
for state in get_states(cookie):
params = {
"start_date": "2020-01-01",
"end_date": "2020-12-31",
"state": state
}
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
for item in response.json()["data"]:
yield item
def main():
cookie = get_cookie()
for result in get_next_state_results(cookie):
print(f"{result['name']}: {result['total']}")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
You can modify the start_date
and end_date
query string parameters in the params
dict in the get_next_state_results
generator to change the month and year.
Here is some of the output. The output is very long, so here is just the first few lines:
Abaré: 137
Acajutiba: 51
Aiquara: 31
Alagoinhas: 1153
Alcobaça: 174
Almadina: 23
Amargosa: 184
Amelia Rodrigues: 171
América Dourada: 120
Anagé: 78
Andaraí: 122
Andorinha: 53
Angical: 65
Anguera: 45
Antas: 106
Antônio Gonçalves: 70
Araças: 82
Aracatu: 95
Araci: 293
Aramari: 39
Aratuípe: 37
Aurelino Leal: 88
Baianópolis: 101
Baixa Grande: 126
Barra: 306
Barra da Estiva: 352
Barra do Choça: 235
Barra do Mendes: 81
Barra do Rocha: 18
Barreiras: 1902
Barro Alto: 86
Barro Preto: 45
Belmonte: 109
Belo Campo: 105
Boa Nova: 83
Upvotes: 1
Reputation: 541
I believe the data is loaded dynamically so I would suggest using Selenium to scrape the data. I am not sure if BeautifulSoup can handle dynamic data on websites.
from selenium import webdriver
browser = webdriver.Chrome()
browser.get("https://transparencia.registrocivil.org.br/registros")
table = browser.find_element_by_css_selector("table")
for elem in table.find_elements_by_css_selector('tr'):
print(elem.text)
Upvotes: 1