Stokolos Ilya
Stokolos Ilya

Reputation: 378

BeautifulSoup4 doesn't find desired elements. What is the problem?

I'm trying to write a program that will extract links of the articles, headlines of which are located here

If you inspect source code, you will see that each link to the article is contained within element h3. For example

<h3 class="cd__headline" data-analytics="_list-hierarchical-xs_article_">
<a href="/2019/10/01/politics/deposition-delayed-impeachment-investigation/index.html">
<span class="cd__headline-text">State Department inspector general requests briefing on 
Ukraine with congressional staff</span><span class="cd__headline-icon cnn-icon"></span></a></h3>

I wrote a code in python (I'm only showing first part of the program, because this is where something goes wrong)

import requests
import bs4
res = requests.get('https://edition.cnn.com/politics')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
a0 = soup.select('h3[class="cd__headline"] > a')
a0

Output: []

What is the problem?


I've tried different pattern

a0 = soup.select('a > span[class="cd__headline-text"]')

Still no luck

Upvotes: 0

Views: 414

Answers (4)

chitown88
chitown88

Reputation: 28565

You have 2 options:

1) As stated by other, use Selenium or some other means, to render the page first, then you can extract the content from that rendered html.

2) Find the data embedded within the <script> tags which in my experience helps me avoid selenium most of the time. The difficult part with that is locating it, then manipulating the string into a valid json format to be read through the json.loads().

I chose option 2:

import requests
import bs4
import json
res = requests.get('https://edition.cnn.com/politics')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')


tags = soup.find_all('script')
for tag in tags:
    if 'var CNN = CNN ||' in tag.text:
        jsonStr = tag.text
        jsonStr = jsonStr.split('siblings:')[-1].strip()
        jsonStr = jsonStr.split(']',1)[0] + ']}'
        jsonData = json.loads(jsonStr)

for article in jsonData['articleList']:
    headline = article['headline']
    link = 'https://edition.cnn.com' + article['uri']

    print ('Headline: %s\nLink: %s\n\n' %(headline, link))

Output:

Headline: Trump ratchets up anti-impeachment rhetoric as troubles mount
Link: https://edition.cnn.com/2019/10/02/politics/president-donald-trump-impeachment-democrats-pompeo/index.html


Headline: Here's what happened in another wild day of the Trump-Ukraine scandal
Link: https://edition.cnn.com/2019/10/01/politics/ukraine-guide-rudy-giuliani-trump-whistleblower/index.html


Headline: All the President's men: Trump's allies part of a tangled web 
Link: https://edition.cnn.com/2019/10/01/politics/trump-act-alone-ukraine-call/index.html


Headline: State Department inspector general requests briefing on Ukraine with congressional staff
Link: https://edition.cnn.com/2019/10/01/politics/deposition-delayed-impeachment-investigation/index.html


Headline: Senior GOP senator rebukes Trump, says whistleblower 'ought to be heard out'
Link: https://edition.cnn.com/2019/10/01/politics/grassley-whistleblower-statement/index.html


Headline: How Lindsey Graham's support for Trump — a man he once called a 'jackass' — has evolved
Link: https://edition.cnn.com/2019/10/01/politics/lindsey-graham-defends-trump-whistleblower/index.html


Headline: Federal judge blocks California law requiring Trump to release tax returns to appear on ballot
Link: https://edition.cnn.com/2019/10/01/politics/california-law-trump-tax-returns-blocked/index.html

...




HOW DID I KNOW TO SEARCH 'var CNN = CNN ||'?

Just takes a little investigating of the html. I could just to View source and then find a headline within and just locate it's tag. Or what I usually do is I'll make little ad-hoc scripts that I throw away later as a way to narrow down the search:

1) I get every tag in the html

import requests
import bs4
import json
res = requests.get('https://edition.cnn.com/politics')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

# Get every tag in html
tags = soup.find_all()

2) Go through every tag to see if a headline is within the text. The headlines change often, so I just go to the url on my browser and pick a substring from a main headline. If I go to https://edition.cnn.com/politics right now, one of the headline reads "Kurt Volker: Diplomat never 'fully on the Trump train' set to appear as first witness in Ukraine probe". Then I just see if a substring of that is present any where. If it is, then I can investigate further, if not, then I'm out of luck and need to see if I can get the data some other way

for tag in tags:
    if "Kurt Volker: Diplomat never 'fully on the Trump train'" in tag.text:  
        tag_name = tag.name
        print ('Possibly found article in %s tag' %tag_name)

And the read out:

Possibly found article in html tag
Possibly found article in head tag
Possibly found article in link tag
Possibly found article in link tag
Possibly found article in link tag
Possibly found article in link tag
Possibly found article in link tag
Possibly found article in link tag
Possibly found article in script tag

3) Ah ha, it is present. Knowing how html structure works, the html tag is the whole document and then each sequential tag is a descendant. My experience tells me that the leaf node/tag where I'll likely find this is in the script tag. So I will now search through the script tags.

scripts = soup.find_all('script')
print (len(scripts))

4) I see there are 28 <script> tags, so which one do I want to look at?

for idx, script in enumerate(scripts):
    if "Kurt Volker: Diplomat never 'fully on the Trump train'" in script.text:  
        print ('Headline found:\nIndex position %s' %idx)

5) Says it's in index position 1. So lets grab that:

scriptStr = scripts[1].text
print (scriptStr)

6) Now I see what I really likely need to search for in the <script> tag is the tag that starts with 'var CNN' in its text, as this will likely not change, while the headlines will, so now I can go back, and instead of looking for the headline substring, I'll just have it find the 'var CNN'.

...
tags = soup.find_all('script')
for tag in tags:
    if 'var CNN = CNN ||' in tag.text:
    ...
    ...

7) The last part (which I won't get into), is to then just trim off all the excess substrings within that to leave the valid json that contains all the data. Once you have that and left with the valid json substring, you can use json.loads() to read that in, then can iterate through the dictionary/list that python stores that in.

Upvotes: 1

Bastien Harkins
Bastien Harkins

Reputation: 305

Based on your initial code :

import requests
import bs4
res = requests.get('https://edition.cnn.com/politics')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)

I suggest you take a look at the soup outside the browser:

with open("cnn_site.txt", "w", encoding='utf-8') as f:
    f.write(soup.prettify())

A quick analysis shows that we don't have the same content as in the browser. Specifically, when searching the text file for h3, you won't find the same as in the browser's developer tools.

It means that when you open the site with your browser, javascript triggers the full html. But not when you use requests.

To confirm this, I've copied the of the loaded site from my browser, into a new html file.

Then:

with open("cnn_body.html") as f:
    content = f.read()
soup = BeautifulSoup(content)
len(soup.find_all('h3'))
>>> 87

So something needs to be added in the request to "trigger" to full html. Or you can parse the content.

Upvotes: 0

Chillie
Chillie

Reputation: 1485

The content on your target page is loaded dynamically with javascript. The initial server response (res) simply does not have the element you are looking for. Inspecting the text in res will confirm that.

The top-voted answer to this question is here.

In a nutshell, you need to use something to execute the JavaScript that loads the content you need.

Your options are Selenium (or any headless browser tool), Scrapy with some JS support middleware or derivative, requests-HTML as proposed in this answer. Or any other JS-loading library you may find.

Upvotes: 1

David Bros
David Bros

Reputation: 94

Might be that you are not initializing the BeautifulSoup object like this:

soup = BeautifulSoup(res.content, 'html.parser')

Upvotes: 0

Related Questions