Reputation: 379
So I am trying to scrape headlines from here. For all the 10 years.
years
is a list here that contains
/resources/archive/us/2007.html
/resources/archive/us/2008.html
/resources/archive/us/2009.html
/resources/archive/us/2010.html
/resources/archive/us/2011.html
/resources/archive/us/2012.html
/resources/archive/us/2013.html
/resources/archive/us/2014.html
/resources/archive/us/2015.html
/resources/archive/us/2016.html
So what my code does here, is it opens each year page, collects all date links and then opens each individually and takes all the .text
and adds each headline and corresponding date as a row to the dataframe headlines
headlines = pd.DataFrame(columns=["date", "headline"])
for y in years:
yurl = "http://www.reuters.com"+str(y)
response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', })
bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml')
days =[]
links = bs.findAll('h5')
for mon in links:
for day in mon.next_sibling.next_sibling:
days.append(day)
days = [e for e in days if str(e) not in ('\n')]
for ind in days:
hlday = ind['href']
date = re.findall('(?!\/)[0-9].+(?=\.)', hlday)[0]
date = date[4:6] + '-' + date[6:] + '-' + date[:4]
print(date.split('-')[2])
yurl = "http://www.reuters.com"+str(hlday)
response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', })
if response.status_code == 404 or response.content == b'':
print('')
else:
bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml')
lines = bs.findAll('div', {'class':'headlineMed'})
for h in lines:
headlines = headlines.append([{"date":date, "headline":h.text}], ignore_index = True)
It takes forever to run, so rather than running the for loop I just ran this for the year /resources/archive/us/2008.html
It's been 3 hours and its still running.
Since I am new to Python, I don't understand, what I am doing wrong, or how I can do this better.
Could it be that the pandas.append
is taking forever because it has to read and write a bigger dataframe each time its run?
Upvotes: 1
Views: 95
Reputation: 249133
You are using this anti-pattern:
headlines = pd.DataFrame()
for for y in years:
for ind in days:
headlines = headlines.append(blah)
Instead, do this:
headlines = []
for for y in years:
for ind in days:
headlines.append(pd.DataFrame(blah))
headlines = pd.concat(headlines)
A second potential problem is that you are making 3650 web requests. If I were operating a website like that, I'd build in throttling to slow down scrapers like yours. You may find it better to collect the raw data once, store it on your disk, then process it in a second pass. Then you don't incur the cost of 3650 web requests every time you need to debug your program.
Upvotes: 1