Reputation: 995
In python3 I want to extract information from a page using requests and beautifulsoup
import requests
from bs4 import BeautifulSoup
link = "https://portal.stf.jus.br/processos/listarPartes.asp?termo=AECIO%20NEVES%20DA%20CUNHA"
try:
res = requests.get(link)
except (requests.exceptions.HTTPError, requests.exceptions.RequestException, requests.exceptions.ConnectionError, requests.exceptions.Timeout) as e:
print(str(e))
except Exception as e:
print("Exceção")
html = res.content.decode('utf-8')
soup = BeautifulSoup(html, "lxml")
pag = soup.find('div', {'id': 'total'})
print(pag)
In this case the information is in an HTML snippet like this:
<div id="total" style="display: inline-block"><input type="hidden" name="totalProc" id="totalProc" value="35">35</div>
What I want to access is value, in this case 35. Capture number "35"
That's why I used "pag = soup.find('div', {'id': 'total'})". To slowly isolate just the number 35
But the content returned was just: <div id="total" style="display: inline-block"><img src="ajax-loader.gif"/></div>
Please does anyone know how to capture value content only?
Upvotes: 3
Views: 2568
Reputation: 84465
It is dynamically pulled from another XHR call you can find in the network tab
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://portal.stf.jus.br/processos/totalProcessosPartes.asp?termo=AECIO%20NEVES%20DA%20CUNHA&total=0')
soup = bs(r.content, 'lxml')
print(soup.select_one('#totalProc')['value'])
With regex
import requests, re
r = requests.get('https://portal.stf.jus.br/processos/totalProcessosPartes.asp?termo=AECIO%20NEVES%20DA%20CUNHA&total=0')
soup = bs(r.content, 'lxml')
print(re.search('value=(\d+)',r.text).groups(0)[0])
Upvotes: 3
Reputation: 363
As I was explaining in the comments, browser automation can be a very quick solution to this issue. The first thing thing you should do is installing Google Chrome on your computer if you haven't got it already. To be fair it could work with any browser, but then I wouldn't be sure on how to set up the code properly, as I have never done it before. Secondly, you must download a tool called "chrome webdriver". You can find it here. Once downloaded, extract the file and put it in the same directory of your python script, which should be the following:
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import time
ch = Options()
ch.add_argument("--disable-extensions")
ch.add_argument("--disable-gpu")
ch.add_argument("--headless")
browser = webdriver.Chrome(options = ch)
page = browser.get("https://portal.stf.jus.br/processos/listarPartes.asp?termo=AECIO%20NEVES%20DA%20CUNHA")
time.sleep(1)
pag = browser.find_element_by_id('totalProc')
print(pag.get_attribute('value'))
browser.quit()
Before executing it, don't forget to do pip install selenium
in your terminal in order to install the actual selenium module.
The script takes about 10-20 seconds to run, but it should work perfectly fine.
Let me know if you have any trouble with it, but you definitely shouldn't.
Upvotes: 1
Reputation: 782
I'm not sure if this is a standard solution, but I personally like using regexes to isolate values from my BeautifulSoup results since they can help capture any kind of pattern. For example, in your case, if you decide to use regex your code could look like:
soup = str(BeautifulSoup(html, "lxml"))
import regex
pag = regex.findall(r'(?<=value=")\d+', soup)
print(pag[0])
You can verify that the regex returns the content in value here.
Upvotes: 1