Reputation: 95
I'm trying to write a scraper in python using urllib and beautiful soup. I have a csv of URLs for news stories, and for ~80% of the pages the scraper works, but when there is a picture at the top of the story the script no longer pulls the time or the body text. I am mostly confused because soup.find and soup.find_all don't seem to produce different results. I have tried a variety of different tags that should capture the text as well as 'lxml' and 'html.parser.'
Here is the code:
testcount = 0
titles1 = []
bodies1 = []
times1 = []
data = pd.read_csv('URLsALLjun27.csv', header=None)
for url in data[0]:
try:
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
titlemess = soup.find(id="title").get_text() #getting the title
titlestring = str(titlemess) #make it a string
title = titlestring.replace("\n", "").replace("\r","")
titles1.append(title)
bodymess = soup.find(class_="article").get_text() #get the body with markup
bodystring = str(bodymess) #make body a string
body = bodystring.replace("\n", "").replace("\u3000","") #scrub markup
bodies1.append(body) #add to list for export
timemess = soup.find('span',{"class":"time"}).get_text()
timestring = str(timemess)
time = timestring.replace("\n", "").replace("\r","").replace("年", "-").replace("月","-").replace("日", "")
times1.append(time)
testcount = testcount +1 #counter
print(testcount)
except Exception as e:
print(testcount, e)
And here are some of the results I get (those marked 'nonetype' are the ones where the title was successfully pulled but body/time is empty)
1 http://news.xinhuanet.com/politics/2016-06/27/c_1119122255.htm
2 http://news.xinhuanet.com/politics/2016-05/22/c_129004569.htm 'NoneType' object has no attribute 'get_text'
Any help would be much appreciated! Thanks.
EDIT: I don't have '10 reputation points' so I can't post more links to test but will comment with them if you need more examples of pages.
Upvotes: 1
Views: 2411
Reputation: 10223
The issue is that there is no class="article"
on the website with the picture in it and same with the "class":"time"
. Consequently, it seems that you'll have to detect whether there's a picture on the website or not and then if there is a picture, search for the date and text as follows:
For the date, try:
timemess = soup.find(id="pubtime").get_text()
For the body text, it seems that the article is rather just the caption for the picture. Consequently, you could try the following:
bodymess = soup.find('img').findNext().get_text()
In brief, the soup.find('img')
finds the image and findNext()
goes to the next block which, coincidentally, contains the text.
Thus, in your code, I would do something as follows:
try:
bodymess = soup.find(class_="article").get_text()
except AttributeError:
bodymess = soup.find('img').findNext().get_text()
try:
timemess = soup.find('span',{"class":"time"}).get_text()
except AttributeError:
timemess = soup.find(id="pubtime").get_text()
As a general flow in web scraping, I usually go to the website itself using a browser and find the elements in the website backend in the browser first.
Upvotes: 1