Reputation: 3
Trying to execute this code for scraping the specific websites / rss feeds metioned here below keep getting :
Traceback (most recent call last):
File "C:\Users\Jeanne\Desktop\PYPDIT\pyscape.py", line 28, in transcripts = [url_to_transcript(u) for u in urls]
File "C:\Users\Jeanne\Desktop\PYPDIT\pyscape.py", line 28, in transcripts = [url_to_transcript(u) for u in urls]
File "C:\Users\Jeanne\Desktop\PYPDIT\pyscape.py", line 17, in url_to_transcript text = [p.text for p in soup.find(class_="itemcontent").find_all('p')]
AttributeError: 'NoneType' object has no attribute 'find_all'
Please advise.
import requests
from bs4 import BeautifulSoup
import pickle
def url_to_transcript(url):
page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")
text = [p.text for p in soup.find(class_="itemcontent").find_all('p')]
print(url)
return text
urls = ['http://feeds.nos.nl/nosnieuwstech',
'http://feeds.nos.nl/nosnieuwsalgemeen']
transcripts = [url_to_transcript(u) for u in urls]
Upvotes: 0
Views: 174
Reputation: 84465
The html returned is not the same as you see on the page. You can use the following:
import requests
from bs4 import BeautifulSoup
# import pickle
urls = ['http://feeds.nos.nl/nosnieuwstech','http://feeds.nos.nl/nosnieuwsalgemeen']
with requests.Session() as s:
for url in urls:
page = s.get(url).text
soup = BeautifulSoup(page, "lxml")
print(url)
print([[i.text for i in desc.select('p')] for desc in soup.select('description')[1:]])
print('--'*100)
Upvotes: 0