Steve
Steve

Reputation: 4668

Extracting additional Content python requests

I am looking to extract generated content from a web page.

I am using the library requests in python 3 to return the page as below

 import requests 
 url = "https://app.updateimpact.com/treeof/org.json4s/json4s- 
  native_2.11/3.5.2"

 html_doc = requests.get(url)
 print(html_doc.text)

The retrieve text seems to be just padding though. What tools should I be looking at to drill into the content and extract the info there ?

Upvotes: 0

Views: 62

Answers (2)

QHarr
QHarr

Reputation: 84465

Javascript needs to run on the page to provide much of the content. Using a method like selenium will allow this to run. Note that an additional wait condition is needed to ensure certain content is loaded. You can then use selenium syntax to extract info or dump the html from page_source into BeautifulSoup.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs

d = webdriver.Chrome()
d.get('https://app.updateimpact.com/treeof/org.json4s/json4s-native_2.11/3.5.2')
dependencies = WebDriverWait(d, 5).until(EC.presence_of_element_located((By.CSS_SELECTOR , '.stats-list')))
print(dependencies)
soup = bs(d.page_source, 'lxml')
print(soup.select_one('#tree').text) # example

Upvotes: 1

Hugo Mota
Hugo Mota

Reputation: 11577

If the content is html, you could look into:

If it's json, you would use:

Upvotes: 0

Related Questions