Reputation: 33
I tried to parse website, but i cannot get entire information about page. To be more precisely, I must have all information between <fgis-root>
and </fgis-root>
, but there is no any info. How can I fix it?
from bs4 import BeautifulSoup
import urllib3
http = urllib3.PoolManager()
url = 'https://pub.fsa.gov.ru/ral/view/8/applicant'
response = http.request('GET', url)
soup = BeautifulSoup(response.data)
print(soup)
Upvotes: 3
Views: 2995
Reputation: 349
The problem you have encountered is a common problem in web scraping.
The web page at https://pub.fsa.gov.ru/ral/view/8/applicant
, loads the javascript file at https://pub.fsa.gov.ru/main.73d6a501bd7bda31d5ec.js, this file is responsible of dynamic content loading.
The root of the problem is that urllib3, requests or any other http client in python does not render the javascript inside that web page. Thus you have only the initial response that the server has provided you which in many cases does not contain the information that you need.
A solution would be to use selenium. It will allow you to interact with a browser such as chrome or firefox programmatically, these browsers actually render the results.
You were not specific about the information that you are trying to scrape off this website, my recommendation is to use an explicit wait until the element you wish to find is present in the DOM. You can find more information about waits in selenium here.
You should adapt this code to scrape the data you wish to scrape.
# Imports
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
# Constants
URL = 'https://pub.fsa.gov.ru/ral/view/8/applicant'
ELEMENT_XPATH = '/html/body/fgis-root/div/fgis-ral/fgis-card-view/div/div/fgis-view-applicant/fgis-card-block/div/div[2]'
def main():
options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get(URL)
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.XPATH, ELEMENT_XPATH))
)
print(element.text)
except TimeoutException:
print("Could not find the desired element")
finally:
driver.quit()
if __name__ == '__main__':
main()
Upvotes: 1
Reputation: 17943
The information is not "hidden," so much as dynamically generated with JavaScript. This can be confirmed by comparing "view source" with the DOM in element inspector of the browser dev tools.
So JavaScript must be executed on a DOM to get the desired information. This can be accomplished by using a headless browser. The headless browser will execute JavaScript like a real browser, and it can be controlled programmatically to retrieve the desired data.
There are several different headless browsers, and drivers written for even more languages. I prefer to use headless Chrome with the Nick.js javascript driver. You could use the example script at the bottom of their homepage with a few modifications.
If you must use Python, here is a good tutorial to get started: Driving Headless Chrome with Python.
Upvotes: 0
Reputation: 1290
Since the content you are looking for is generated from javascript, you need to emulate a browser. You can use selenium
to do that:
from selenium import webdriver
with webdriver.Firefox() as driver: # e.g. using Firefox webdriver
driver.get('your_url_here')
i = driver.find_elements_by_tag_name("fgis-root")
Also check-out here all the available methods that selenium
provides to locate elements in a page.
Upvotes: 2
Reputation: 84475
You can mimic GET request. This info came from the web traffic observed in dev tools, F12, Network tab when loading the page. The authorisation and session id may be time limited. You can use Session to handle the cookies part by making prior request to former url first within same session.
import requests
import urllib3; urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
headers = {
'Pragma': 'no-cache',
'DNT': '1',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'lkId': '',
'Accept': 'application/json, text/plain, */*',
'Cache-Control': 'no-cache',
'Authorization': 'Bearer eyJhbGciOiJIUzUxMiJ9.eyJpc3MiOiI5ZDhlNWJhNy02ZDg3LTRiMWEtYjZjNi0xOWZjMDJlM2QxZWYiLCJzdWIiOiJhbm9ueW1vdXMiLCJleHAiOjE1NjMyMzUwNjZ9.OnUcjrEXUsrmFyDBpgvhzznHMFicEknSDkjCyxaugO5z992H-McRRD9bfwNl7xMI3dm2HtdAPuTu3nnFzgCLuQ',
'Connection': 'keep-alive',
'Referer': 'https://pub.fsa.gov.ru/ral/view/8/applicant',
'orgId': '',
}
with requests.Session() as s:
r = s.get('https://pub.fsa.gov.ru/ral/view/8/applicant', verify = False)
r = s.get('https://pub.fsa.gov.ru/api/v1/ral/common/companies/8', headers=headers).json()
print(r)
Upvotes: 1