Reputation:
I am trying to practice some web-scraping tools. What I want is to extract information about past articles in a newspaper (ID and URLs). So, I have a URL, where the procedure is going to be applied.
My problem is when I want to extract information from these articles. No matter what kind of library I use, I cannot access this information using web-scraping, because there is a 'div' that does not allow me to go deeper in the information extraction.
Each article has a class called: "searchRecordList Detail_search search_divider clearfix" where images, URLs, and other information are stored. All these articles are also stored in another div called "divSearchResults". Nevertheless, it does not let me extract or loop over it. Python always read it as empty or similar.
This is the HTML structure that has the article information:
<div id="divSearchResults" class="searchRecordContent">
<div class="searchRecordList Detail_search search_divider clearfix">
<div class="image">
<a style="display: block;" pubid="19789" pubtitle="Boston Globe" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" id="a_img_161988851" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" class="srcimg-link">
<img src="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" data-original="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" width="180" height="180" alt="Boston Globe" class="srcimg lazy" style="display: inline;"></a></div>
<div class="detail">
<div class="pull-right flagIcon unitedstatesofamerica"><a aria-label="United States Of America" aria-valuetext="United States Of America" href="https://newspaperarchive.com/tags/?pep=dependency&pr=10&pci=7/" class="tooltipElement" rel="tooltip" data-original-title="Narrow results to this country only?"><svg aria-hidden="true" width="32px" height="32px" class="flagborder"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="/Content/assets/images/flag-icon.svg#unitedstatesofamerica"></use></svg></a></div>
<h3><a pubid="19789" pubtitle="Boston Globe" id="161988851" class="result-link" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" target="_blank">Boston Globe</a><span tabindex="0">Sunday, July 16, 1922, Boston, Massachusetts, United States Of America</span></h3>
<div tabindex="0" class="text"><b>dependency</b> within fivo years Of the death of such a vet right whatever unless they make claim FIBRE TUXEDO EXT...Boston Globe (Newspaper) - July 16, 1922, Boston, Massachusetts</div>
<div class="bottomBtn">
<a class="btn btn-gradgrey" style="" id="ahref_161988851" href="javascript:void(0);" onclick="javascript:UpgradePopup();">Save to Treasure Box</a> <a class="btn btn-gradgrey" onclick="javascript:UpgradePopup();" href="javascript:void(0)">Don't Show Me Again</a>
</div>
<div tabindex="0" class="dateaddedgrey"> Date Added May 31, 2010</div>
</div>
</div>
<div class="searchRecordList Detail_search search_divider clearfix">
</div>
<div class="searchRecordList Detail_search search_divider clearfix">
</div>
<div class="searchRecordList Detail_search search_divider clearfix">
</div>
<div class="searchRecordList Detail_search search_divider clearfix">
</div>
<div class="searchRecordList Detail_search search_divider clearfix">
</div>
<div class="searchRecordList Detail_search search_divider clearfix">
</div>
<div class="searchRecordList Detail_search search_divider clearfix">
</div>
<div class="searchRecordList Detail_search search_divider clearfix">
</div>
<div class="searchRecordList Detail_search search_divider clearfix">
</div>
</div>
I have used the BeautifulSoup and the xpath approach, but I cannot access the articles div's.
I have tried searching for different classes inside each article, but without success (class:detail,result-link)
# First method
# Code
import requests
from bs4 import BeautifulSoup
url = 'https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency'
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
results = soup.find_all("div", class_="searchRecordContent")
print(results)
# Second method
# Code
from lxml import html
import requests
url = 'https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency'
page = requests.get(url)
tree = html.fromstring(page.content)
r = tree.xpath('//*[@id="divSearchResults"]')
print(r)
This is the expected result where I can find the URLs and IDs from each article:
# Expected
<div id="divSearchResults" class="searchRecordContent">
<div class="searchRecordList Detail_search search_divider clearfix">
<div class="image">
<a style="display: block;" pubid="19789" pubtitle="Boston Globe" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" id="a_img_161988851" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" class="srcimg-link">
<img src="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" data-original="https://newspaperarchive.com/us/massachusetts/boston/boston-globe/1922/07-16/161988851-thumbnail.jpg" width="180" height="180" alt="Boston Globe" class="srcimg lazy" style="display: inline;"></a></div>
<div class="detail">
<div class="pull-right flagIcon unitedstatesofamerica"><a aria-label="United States Of America" aria-valuetext="United States Of America" href="https://newspaperarchive.com/tags/?pep=dependency&pr=10&pci=7/" class="tooltipElement" rel="tooltip" data-original-title="Narrow results to this country only?"><svg aria-hidden="true" width="32px" height="32px" class="flagborder"><use xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="/Content/assets/images/flag-icon.svg#unitedstatesofamerica"></use></svg></a></div>
<h3><a pubid="19789" pubtitle="Boston Globe" id="161988851" class="result-link" rel="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97" href="https://newspaperarchive.com/boston-globe-jul-16-1922-p-97/" target="_blank">Boston Globe</a><span tabindex="0">Sunday, July 16, 1922, Boston, Massachusetts, United States Of America</span></h3>
<div tabindex="0" class="text"><b>dependency</b> within fivo years Of the death of such a vet right whatever unless they make claim FIBRE TUXEDO EXT...Boston Globe (Newspaper) - July 16, 1922, Boston, Massachusetts</div>
<div class="bottomBtn">
<a class="btn btn-gradgrey" style="" id="ahref_161988851" href="javascript:void(0);" onclick="javascript:UpgradePopup();">Save to Treasure Box</a> <a class="btn btn-gradgrey" onclick="javascript:UpgradePopup();" href="javascript:void(0)">Don't Show Me Again</a>
</div>
<div tabindex="0" class="dateaddedgrey"> Date Added May 31, 2010</div>
</div>
</div>
....
### (the same way for the other 9 articles)
So the question is:
How can I access to the 'searchRecordList Detail_search search_divider clearfix' div from each article using Python?
Upvotes: 1
Views: 576
Reputation: 84465
Content is loaded dynamically. I think the POST request may even be asynchronous. One approach is to use Selenium which allows javascript to run on the page. You need an additional wait condition for content to be present. I wait for one of the elements in relation to loading spinner, with class ajax-loading-block-window
, to achieve its style
attribute value present on page load complete.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://newspaperarchive.com/tags/?pc=3091&psi=50&pci=7&pt=19789&ndt=bd&pd=1&pm=1&py=1920&pe=31&pem=12&pey=1929&pep=dependency/')
WebDriverWait(d, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '.ajax-loading-block-window[style="height: 100%; display: none;"]')))
data = [(i.get_attribute('id') , i.get_attribute('href') ) for i in d.find_elements_by_css_selector('.result-link')]
Upvotes: 2