How to do webscraping in Python from site that generates tables?

Question

In Python3, I need to scrape a site that has search options in bar menu: http://www.cnj.jus.br/bnmp/#/pesquisar

I just need to select the item "Estado" and within it the option "Rio de Janeiro" (it is a State of Brazil, with several cities). Then click "Pesquisar"

The site generates a screen with the items I need to store in a dataframe after (on multiple pages, with a table in each) - 53,022 items such as:

Numero: "0002274-09.2012.8.19.0002.0001"
Nome: "Bruno Da Silva"
Situacao: "Aguardando Cumprimento"
Data: "23/01/2012"
Orgao: "TJRJ"

...

And so on in the following lines and pages

With inspect element, in Network I tried to find in XHR the site with the JSON of the website that I wish, but I found only a link with the cities (municipios) of the State of Rio de Janeiro:

import requests
import pandas as pd

url = 'http://www.cnj.jus.br/bnmp/rest/pesquisarMunicipios/RJ'
response = requests.get(url)
print(response.json())
{'sucesso': True, 'mensagem': None, 'municipios': ['ANGRA DOS REIS', 'APERIBE', 'ARARUAMA', 'ARMACAO DOS BUZIOS', 'ARRAIAL DO CABO', 'BARRA DO PIRAI', 'BARRA MANSA', 'BELFORD ROXO', 'BOM JARDIM', 'BOM JESUS DO ITABAPOANA', 'CABO FRIO', 'CACHOEIRAS DE MACACU', 'CAMBUCI', 'CAMPOS DOS GOYTACAZES', 'CANTAGALO', 'CARAPEBUS', 'CARDOSO MOREIRA', 'CARMO', 'CASIMIRO DE ABREU', 'CONCEICAO DE MACABU', 'CORDEIRO', 'DUAS BARRAS', 'DUQUE DE CAXIAS', 'ENGENHEIRO PAULO DE FRONTIN', 'GUAPIMIRIM', 'IGUABA GRANDE', 'ITABORAI', 'ITAGUAI', 'ITALVA', 'ITAOCARA', 'ITAPERUNA', 'ITATIAIA', 'JAPERI', 'LAJE DO MURIAE', 'MACAE', 'MAGE', 'MANGARATIBA', 'MARICA', 'MENDES', 'MESQUITA', 'MIGUEL PEREIRA', 'MIRACEMA', 'NATIVIDADE', 'NILOPOLIS', 'NITEROI', 'NOVA FRIBURGO', 'NOVA IGUACU', 'PARACAMBI', 'PARAIBA DO SUL', 'PARATI', 'PATY DO ALFERES', 'PETROPOLIS', 'PINHEIRAL', 'PIRAI', 'PORCIUNCULA', 'PORTO REAL', 'QUEIMADOS', 'QUISSAMA', 'RESENDE', 'RIO BONITO', 'RIO CLARO', 'RIO DAS FLORES', 'RIO DAS OSTRAS', 'RIO DE JANEIRO', 'SANTA MARIA MADALENA', 'SANTO ANTONIO DE PADUA', 'SAO FIDELIS', 'SAO FRANCISCO DE ITABAPOANA', 'SAO GONCALO', 'SAO JOAO DA BARRA', 'SAO JOAO DE MERITI', 'SAO JOSE DO VALE DO RIO PRETO', 'SAO PEDRO DA ALDEIA', 'SAO SEBASTIAO DO ALTO', 'SAPUCAIA', 'SAQUAREMA', 'SEROPEDICA', 'SILVA JARDIM', 'SUMIDOURO', 'TERESOPOLIS', 'TRAJANO DE MORAES', 'TRAJANO DE MORAIS', 'TRES RIOS', 'VALENCA', 'VARRE-SAI', 'VASSOURAS', 'VOLTA REDONDA']}

Please, is there any way to find the created JSON of the items I want to scrape?

Or is there a better scraping strategy?

afult · Accepted Answer

I was able to see the right XHR request file with chrome's developer tools networks tab. I had the preserve log option checked, so that may be why I was able to see it when you weren't.

I found it by starting at http://www.cnj.jus.br/bnmp/#/pesquisar, then selecting an estado, clicking Pesquisar and then checking the network logs.

It looks like you need to make a post request to http://www.cnj.jus.br/bnmp/rest/pesquisar. Youll also need to edit the payload to include the state and the page you need.

so it should look like this:

payload = {
    "criterio":{
        "orgaoJulgador":{
            "uf":"AC",
            "municipio":"",
            "descricao":""
        },
        "orgaoJTR":{},
        "parte":{
            "documentos":[
                {"identificacao":""}
             ]
         }
     },
     "paginador":{"paginaAtual":2},
     "fonetica":"true",
     "ordenacao":{"porNome":False,"porData":False}
}

url = ('http://www.cnj.jus.br/bnmp/rest/pesquisar')

r = requests.post(url, json=payload)
print(r.status_code)

print(r.json())

How to do webscraping in Python from site that generates tables?

Answers (1)

Related Questions