SamFlynn
SamFlynn

Reputation: 379

Python - increase speed of code Pandas.append

So I am trying to scrape headlines from here. For all the 10 years.

years is a list here that contains

/resources/archive/us/2007.html
/resources/archive/us/2008.html
/resources/archive/us/2009.html
/resources/archive/us/2010.html
/resources/archive/us/2011.html
/resources/archive/us/2012.html
/resources/archive/us/2013.html
/resources/archive/us/2014.html
/resources/archive/us/2015.html
/resources/archive/us/2016.html

So what my code does here, is it opens each year page, collects all date links and then opens each individually and takes all the .text and adds each headline and corresponding date as a row to the dataframe headlines

headlines = pd.DataFrame(columns=["date", "headline"])

for y in years:
   yurl = "http://www.reuters.com"+str(y)
   response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', })
   bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') 

   days =[]
   links = bs.findAll('h5')
   for mon in links:
      for day in mon.next_sibling.next_sibling:
          days.append(day)

   days = [e for e in days if str(e) not in ('\n')]
   for ind in days:
       hlday = ind['href']
       date = re.findall('(?!\/)[0-9].+(?=\.)', hlday)[0]
       date =  date[4:6] + '-' + date[6:] + '-' + date[:4]
       print(date.split('-')[2])
       yurl = "http://www.reuters.com"+str(hlday)
       response=requests.get(yurl,headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36', })
       if response.status_code == 404 or response.content == b'':
           print('')
       else: 
           bs= BeautifulSoup(response.content.decode('ascii', 'ignore'),'lxml') 
           lines = bs.findAll('div', {'class':'headlineMed'})
           for h in lines:
               headlines = headlines.append([{"date":date, "headline":h.text}], ignore_index = True)

It takes forever to run, so rather than running the for loop I just ran this for the year /resources/archive/us/2008.html

It's been 3 hours and its still running.

Since I am new to Python, I don't understand, what I am doing wrong, or how I can do this better.

Could it be that the pandas.append is taking forever because it has to read and write a bigger dataframe each time its run?

Upvotes: 1

Views: 95

Answers (1)

John Zwinck
John Zwinck

Reputation: 249133

You are using this anti-pattern:

headlines = pd.DataFrame()
for for y in years:
    for ind in days:
        headlines = headlines.append(blah)

Instead, do this:

headlines = []
for for y in years:
    for ind in days:
        headlines.append(pd.DataFrame(blah))

headlines = pd.concat(headlines)

A second potential problem is that you are making 3650 web requests. If I were operating a website like that, I'd build in throttling to slow down scrapers like yours. You may find it better to collect the raw data once, store it on your disk, then process it in a second pass. Then you don't incur the cost of 3650 web requests every time you need to debug your program.

Upvotes: 1

Related Questions