Sumus Vegann
Sumus Vegann

Reputation: 3

Scape issue in beautiful soup, NoneType' object has no attribute 'find_all

Trying to execute this code for scraping the specific websites / rss feeds metioned here below keep getting :

Traceback (most recent call last):

File "C:\Users\Jeanne\Desktop\PYPDIT\pyscape.py", line 28, in transcripts = [url_to_transcript(u) for u in urls]

File "C:\Users\Jeanne\Desktop\PYPDIT\pyscape.py", line 28, in transcripts = [url_to_transcript(u) for u in urls]

File "C:\Users\Jeanne\Desktop\PYPDIT\pyscape.py", line 17, in url_to_transcript text = [p.text for p in soup.find(class_="itemcontent").find_all('p')]

AttributeError: 'NoneType' object has no attribute 'find_all'

Please advise.

import requests
from bs4 import BeautifulSoup
import pickle

def url_to_transcript(url):

page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")
text = [p.text for p in soup.find(class_="itemcontent").find_all('p')]
print(url)
return text

URLs of transcripts in scope

urls = ['http://feeds.nos.nl/nosnieuwstech',
        'http://feeds.nos.nl/nosnieuwsalgemeen']

transcripts = [url_to_transcript(u) for u in urls]

Upvotes: 0

Views: 174

Answers (1)

QHarr
QHarr

Reputation: 84465

The html returned is not the same as you see on the page. You can use the following:

import requests
from bs4 import BeautifulSoup
 # import pickle

urls = ['http://feeds.nos.nl/nosnieuwstech','http://feeds.nos.nl/nosnieuwsalgemeen']

with requests.Session() as s:
    for url in urls:
        page = s.get(url).text
        soup = BeautifulSoup(page, "lxml")
        print(url)
        print([[i.text for i in desc.select('p')] for desc in soup.select('description')[1:]])
        print('--'*100)

Upvotes: 0

Related Questions