Reinaldo Chaves
Reinaldo Chaves

Reputation: 995

How to get the hidden input's value with beautifulsoup?

In python3 I want to extract information from a page using requests and beautifulsoup

import requests
from bs4 import BeautifulSoup

link = "https://portal.stf.jus.br/processos/listarPartes.asp?termo=AECIO%20NEVES%20DA%20CUNHA"

try:
    res = requests.get(link)
except (requests.exceptions.HTTPError, requests.exceptions.RequestException, requests.exceptions.ConnectionError, requests.exceptions.Timeout) as e:
    print(str(e))
except Exception as e:
    print("Exceção")

html = res.content.decode('utf-8') 

soup =  BeautifulSoup(html, "lxml")

pag = soup.find('div', {'id': 'total'})

print(pag)

In this case the information is in an HTML snippet like this:

<div id="total" style="display: inline-block"><input type="hidden" name="totalProc" id="totalProc" value="35">35</div>

What I want to access is value, in this case 35. Capture number "35"

That's why I used "pag = soup.find('div', {'id': 'total'})". To slowly isolate just the number 35

But the content returned was just: <div id="total" style="display: inline-block"><img src="ajax-loader.gif"/></div>

Please does anyone know how to capture value content only?

Upvotes: 3

Views: 2568

Answers (3)

QHarr
QHarr

Reputation: 84465

It is dynamically pulled from another XHR call you can find in the network tab

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://portal.stf.jus.br/processos/totalProcessosPartes.asp?termo=AECIO%20NEVES%20DA%20CUNHA&total=0')
soup = bs(r.content, 'lxml')
print(soup.select_one('#totalProc')['value'])

With regex

import requests, re

r = requests.get('https://portal.stf.jus.br/processos/totalProcessosPartes.asp?termo=AECIO%20NEVES%20DA%20CUNHA&total=0')
soup = bs(r.content, 'lxml')
print(re.search('value=(\d+)',r.text).groups(0)[0])

Upvotes: 3

Michele Bastione
Michele Bastione

Reputation: 363

As I was explaining in the comments, browser automation can be a very quick solution to this issue. The first thing thing you should do is installing Google Chrome on your computer if you haven't got it already. To be fair it could work with any browser, but then I wouldn't be sure on how to set up the code properly, as I have never done it before. Secondly, you must download a tool called "chrome webdriver". You can find it here. Once downloaded, extract the file and put it in the same directory of your python script, which should be the following:

from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import time

ch = Options()
ch.add_argument("--disable-extensions")
ch.add_argument("--disable-gpu")
ch.add_argument("--headless")

browser = webdriver.Chrome(options = ch)
page = browser.get("https://portal.stf.jus.br/processos/listarPartes.asp?termo=AECIO%20NEVES%20DA%20CUNHA")
time.sleep(1)
pag = browser.find_element_by_id('totalProc')

print(pag.get_attribute('value'))
browser.quit()

Before executing it, don't forget to do pip install selenium in your terminal in order to install the actual selenium module. The script takes about 10-20 seconds to run, but it should work perfectly fine. Let me know if you have any trouble with it, but you definitely shouldn't.

Upvotes: 1

Anshul Rai
Anshul Rai

Reputation: 782

I'm not sure if this is a standard solution, but I personally like using regexes to isolate values from my BeautifulSoup results since they can help capture any kind of pattern. For example, in your case, if you decide to use regex your code could look like:

soup =  str(BeautifulSoup(html, "lxml"))

import regex
pag = regex.findall(r'(?<=value=")\d+', soup)

print(pag[0])

You can verify that the regex returns the content in value here.

Upvotes: 1

Related Questions