I cannot extract information from a particular div using web-scraping with Python. What can I do?

Question

I am trying to practice some web-scraping tools. What I want is to extract information about past articles in a newspaper (ID and URLs). So, I have a URL, where the procedure is going to be applied.

My problem is when I want to extract information from these articles. No matter what kind of library I use, I cannot access this information using web-scraping, because there is a 'div' that does not allow me to go deeper in the information extraction.

Each article has a class called: "searchRecordList Detail_search search_divider clearfix" where images, URLs, and other information are stored. All these articles are also stored in another div called "divSearchResults". Nevertheless, it does not let me extract or loop over it. Python always read it as empty or similar.

This is the HTML structure that has the article information:








Boston GlobeSunday, July 16, 1922, Boston, Massachusetts, United States Of America
dependency within fivo years Of the death of such a vet right whatever unless they make claim FIBRE TUXEDO EXT...Boston Globe (Newspaper) - July 16, 1922, Boston, Massachusetts

Save to Treasure Box Don't Show Me Again

 Date Added May 31, 2010

I have used the BeautifulSoup and the xpath approach, but I cannot access the articles div's.

I have tried searching for different classes inside each article, but without success (class:detail,result-link)

# First method
# Code
import requests
from bs4 import BeautifulSoup

url = 'https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency'


response = requests.get(url)
soup     = BeautifulSoup(response.content, "html.parser")

results = soup.find_all("div", class_="searchRecordContent")
print(results)

# Second method
# Code
from lxml import html
import requests
   

 url = 'https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency'

 page = requests.get(url)
 tree = html.fromstring(page.content)

 r = tree.xpath('//*[@id="divSearchResults"]')
 print(r)

This is the expected result where I can find the URLs and IDs from each article:

# Expected







Boston GlobeSunday, July 16, 1922, Boston, Massachusetts, United States Of America
dependency within fivo years Of the death of such a vet right whatever unless they make claim FIBRE TUXEDO EXT...Boston Globe (Newspaper) - July 16, 1922, Boston, Massachusetts

Save to Treasure Box Don't Show Me Again

 Date Added May 31, 2010


.... 
### (the same way for the other 9 articles)

So the question is:

How can I access to the 'searchRecordList Detail_search search_divider clearfix' div from each article using Python?

QHarr · Accepted Answer

Content is loaded dynamically. I think the POST request may even be asynchronous. One approach is to use Selenium which allows javascript to run on the page. You need an additional wait condition for content to be present. I wait for one of the elements in relation to loading spinner, with class ajax-loading-block-window, to achieve its style attribute value present on page load complete.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency/')
WebDriverWait(d, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.ajax-loading-block-window[style="height: 100%; display: none;"]')))
data = [(i.get_attribute('id') , i.get_attribute('href') ) for i in d.find_elements_by_css_selector('.result-link')]

I cannot extract information from a particular div using web-scraping with Python. What can I do?

Answers (1)

Related Questions