Reputation: 51

How to speed up web scraping in python

I'm working on a project for school and I am trying to get data about movies. I've managed to write a script to get the data I need from IMDbPY and Open Movie DB API (omdbapi.com). The challenge I'm experiencing is that I'm trying to get data for 22,305 movies and each request takes about 0.7 seconds. Essentially my current script will take about 8 hours to complete. Looking for any way to maybe use multiple requests at the same time or any other suggestions to significantly speed up the process of getting this data.

import urllib2
import json
import pandas as pd
import time
import imdb

start_time = time.time() #record time at beginning of script

#used to make imdb.com think we are getting this data from a browser
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }

#Open Movie Database Query url for IMDb IDs
url = 'http://www.omdbapi.com/?tomatoes=true&i='

#read the ids from the imdb_id csv file
imdb_ids = pd.read_csv('ids.csv')

cols = [u'Plot', u'Rated', u'tomatoImage', u'Title', u'DVD', u'tomatoMeter',
 u'Writer', u'tomatoUserRating', u'Production', u'Actors', u'tomatoFresh',
 u'Type', u'imdbVotes', u'Website', u'tomatoConsensus', u'Poster', u'tomatoRotten',
 u'Director', u'Released', u'tomatoUserReviews', u'Awards', u'Genre', u'tomatoUserMeter',
 u'imdbRating', u'Language', u'Country', u'imdbpy_budget', u'BoxOffice', u'Runtime',
 u'tomatoReviews', u'imdbID', u'Metascore', u'Response', u'tomatoRating', u'Year',
 u'imdbpy_gross']

#create movies dataframe
movies = pd.DataFrame(columns=cols)

i=0
for i in range(len(imdb_ids)-1):

    start = time.time()
    req = urllib2.Request(url + str(imdb_ids.ix[i,0]), None, headers) #request page
    response = urllib2.urlopen(req) #actually call the html request
    the_page = response.read() #read the json from the omdbapi query
    movie_json = json.loads(the_page) #convert the json to a dict

    #get the gross revenue and budget from IMDbPy
    data = imdb.IMDb()
    movie_id = imdb_ids.ix[i,['imdb_id']]
    movie_id = movie_id.to_string()
    movie_id = int(movie_id[-7:])
    data = data.get_movie_business(movie_id)
    data = data['data']
    data = data['business']

    #get the budget $ amount out of the budget IMDbPy string
    try:
        budget = data['budget']
        budget = budget[0]
        budget = budget.replace('$', '')
        budget = budget.replace(',', '')
        budget = budget.split(' ')
        budget = str(budget[0]) 
    except:
        None

    #get the gross $ amount out of the gross IMDbPy string
    try:
        budget = data['budget']
        budget = budget[0]
        budget = budget.replace('$', '')
        budget = budget.replace(',', '')
        budget = budget.split(' ')
        budget = str(budget[0])

        #get the gross $ amount out of the gross IMDbPy string
        gross = data['gross']
        gross = gross[0]
        gross = gross.replace('$', '')
        gross = gross.replace(',', '')
        gross = gross.split(' ')
        gross = str(gross[0])
    except:
        None

    #add gross to the movies dict 
    try:
        movie_json[u'imdbpy_gross'] = gross
    except:
        movie_json[u'imdbpy_gross'] = 0

    #add gross to the movies dict    
    try:
        movie_json[u'imdbpy_budget'] = budget
    except:
        movie_json[u'imdbpy_budget'] = 0

    #create new dataframe that can be merged to movies DF    
    tempDF = pd.DataFrame.from_dict(movie_json, orient='index')
    tempDF = tempDF.T

    #add the new movie to the movies dataframe
    movies = movies.append(tempDF, ignore_index=True)
    end = time.time()
    time_took = round(end-start, 2)
    percentage = round(((i+1) / float(len(imdb_ids))) * 100,1)
    print i+1,"of",len(imdb_ids),"(" + str(percentage)+'%)','completed',time_took,'sec'
    #increment counter
    i+=1  

#save the dataframe to a csv file            
movies.to_csv('movie_data.csv', index=False)
end_time = time.time()
print round((end_time-start_time)/60,1), "min"

Upvotes: 5

Answers (2)

Granitosaurus

Reputation: 21436

When web-scraping we generally have two types of bottlenecks:

IO blocks - whenever we make a request, we need to wait for the server to respond, which can block our entire program.
CPU blocks - when parsing web scraped content, our code might be limited by CPU processing power.

CPU Speed

CPU blocks are an easy fix - we can spawn more processes. Generally, 1 CPU core can efficiently handle 1 process. So if our scraper is running on a machine that has 12 CPU cores we can spawn 12 processes for 12x speed boost:

from concurrent.futures import ProcessPoolExecutor

def parse(html):
    ...  # CPU intensive parsing
    
htmls = [...]
with ProcessPoolExecutor() as executor:
    for result in executor.map(parse, htmls):
        print(result)

Python's ProcessPooolExecutor spawns optimal amount of threads (equal to CPU cores) and distributes task through them.

IO Speed

For IO-blocking we have more options as our goal is to get rid of useless waiting which can be done through threads, processes and asyncio loops.

If we're making thousands of requests we can't spawn hundreds of processes. Threads will be less expensive but still, there's a better option - asyncio loops.

Asyncio loops can execute tasks in no specific order. In other words, while task A is being blocked task B can take over the program. This is perfect for web scraping as there's very little overhead computing going on. We can scale to thousands requests in a single program.

Unfortunately, for asycio to work, we need to use python packages that support asyncio. For example, by using httpx and asyncio we can speed up our scraping significantly:

# comparing synchronous `requests`:
import requests
from time import time

_start = time()
for i in range(50):
    request.get("http://httpbin.org/delay/1")
print(f"finished in: {time() - _start:.2f} seconds")
# finished in: 52.21 seconds

# versus asynchronous `httpx`
import httpx
import asyncio
from time import time

_start = time()

async def main():
    async with httpx.AsyncClient() as client:
        tasks = [client.get("http://httpbin.org/delay/1") for i in range(50)]
        for response_future in asyncio.as_completed(tasks):
            response = await response_future
    print(f"finished in: {time() - _start:.2f} seconds")

asyncio.run(main())
# finished in: 3.55 seconds

Combining Both

With async code we can avoid IO-blocks and with processes we can scale up CPU intensive parsing - a perfect combo to optimize web scraping:

import asyncio
import multiprocessing
from concurrent.futures import ProcessPoolExecutor
from time import sleep, time

import httpx


async def scrape(urls):
    """this is our async scraper that scrapes"""
    results = []
    async with httpx.AsyncClient(timeout=httpx.Timeout(30.0)) as client:
        scrape_tasks = [client.get(url) for url in urls]
        for response_f in asyncio.as_completed(scrape_tasks):
            response = await response_f
            # emulate data parsing/calculation
            sleep(0.5)
            ...
            results.append("done")
    return results


def scrape_wrapper(args):
    i, urls = args
    print(f"subprocess {i} started")
    result = asyncio.run(scrape(urls))
    print(f"subprocess {i} ended")
    return result


def multi_process(urls):
    _start = time()

    batches = []
    batch_size = multiprocessing.cpu_count() - 1  # let's keep 1 core for ourselves
    print(f"scraping {len(urls)} urls through {batch_size} processes")
    for i in range(0, len(urls), batch_size):
        batches.append(urls[i : i + batch_size])
    with ProcessPoolExecutor() as executor:
        for result in executor.map(scrape_wrapper, enumerate(batches)):
            print(result)
        print("done")

    print(f"multi-process finished in {time() - _start:.2f}")

def single_process(urls):
    _start = time()
    results = asyncio.run(scrape(urls))
    print(f"single-process finished in {time() - _start:.2f}")



if __name__ == "__main__":
    urls = ["http://httpbin.org/delay/1" for i in range(100)]
    multi_process(urls)
    # multi-process finished in 7.22
    single_process(urls)
    # single-process finished in 51.28

These foundation concepts sound complex, but once you narrow it down to the roots of the issue, the fixes are very straight and already present in Python!

For more details on this subject see my blog Web Scraping Speed: Processes, Threads and Async

Upvotes: 2

Jan Vlcinsky

Reputation: 44112

Use Eventlet library to fetch concurently

As advised in comments, you shall fetch your feeds concurrently. This can be done by using treading, multiprocessing, or using eventlet.

Install eventlet

$ pip install eventlet

Try web crawler sample from `eventlet`

See: http://eventlet.net/doc/examples.html#web-crawler

Understanding concurrency with `eventlet`

With threading system takes care of switching between your threads. This brings big problem in case you have to access some common data structures, as you never know, which other thread is currently accessing your data. You then start playing with synchronized blocks, locks, semaphores - just to synchronize access to your shared data structures.

With eventlet it goes much simpler - you always run only one thread and jump between them only at I/O instructions or at other eventlet calls. The rest of your code runs uninterrupted and without a risk, another thread would mess up with our data.

You only have to take care of following:

all I/O operations must be non-blocking (this is mostly easy, eventlet provides non-blocking versions for most of the I/O you need).
your remaining code must not be CPU expensive as it would block switching between "green" threads for longer time and the power of "green" multithreading would be gone.

Great advantage with eventlet is, that it allows to write code in straightforward way without spoiling it (too) much with Locks, Semaphores etc.

Apply `eventlet` to your code

If I understand it correctly, list of urls to fetch is known in advance and order of their processing in your analysis is not important. This shall allow almost direct copy of example from eventlet. I see, that an index i has some significance, so you might consider mixing url and the index as a tuple and processing them as independent jobs.

There are definitely other methods, but personally I have found eventlet really easy to use comparing it to other techniques while getting really good results (especially with fetching feeds). You just have to grasp main concepts and be a bit careful to follow eventlet requirements (keep being non-blocking).

Fetching urls using requests and eventlet - erequests

There are various packages for asynchronous processing with requests, one of them using eventlet and being namederequests see https://github.com/saghul/erequests

Simple sample fetching set of urls

import erequests

# have list of urls to fetch
urls = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]
# erequests.async.get(url) creates asynchronous request
async_reqs = [erequests.async.get(url) for url in urls]
# each async request is ready to go, but not yet performed

# erequests.map will call each async request to the action
# what returns processed request `req`
for req in erequests.map(async_reqs):
    if req.ok:
        content = req.content
        # process it here
        print "processing data from:", req.url

Problems for processing this specific question

We are able to fetch and somehow process all urls we need. But in this question, processing is bound to particular record in source data, so we will need to match processed request with index of record we need for getting further details for final processing.

As we will see later, asynchronous processing does not honour order of requests, some are processed sooner and some later and map yields whatever is completed.

One option is to attach index of given url to the requests and use it later when processing returned data.

Complex sample of fetching and processing urls with preserving url indices

Note: following sample is rather complex, if you can live with solution provided above, skip this. But make sure you are not running into problems detected and resolved below (urls being modified, requests following redirects).

import erequests
from itertools import count, izip
from functools import partial

urls = [
    'http://www.heroku.com',
    'http://python-tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]

def print_url_index(index, req, *args, **kwargs):
    content_length = req.headers.get("content-length", None)
    todo = "PROCESS" if req.status_code == 200 else "WAIT, NOT YET READY"
    print "{todo}: index: {index}: status: {req.status_code}: length: {content_length}, {req.url}".format(**locals())

async_reqs = (erequests.async.get(url, hooks={"response": partial(print_url_index, i)}) for i, url in izip(count(), urls))

for req in erequests.map(async_reqs):
    pass

Attaching hooks to request

requests (and erequests too) allows defining hooks to event called response. Each time, the request gets a response, this hook function is called and can do something or even modify the response.

Following line defines some hook to response:

erequests.async.get(url, hooks={"response": partial(print_url_index, i)})

Passing url index to the hook function

Signature of any hook shall be func(req, *args, *kwargs)

But we need to pass into the hook function also the index of url we are processing.

For this purpose we use functools.partial which allows creation of simplified functions by fixing some of parameters to specific value. This is exactly what we need, if you see print_url_index signature, we need just to fix value of index, the rest will fit requirements for hook function.

In our call we use partial with name of simplified function print_url_index and providing for each url unique index of it.

Index could be provided in the loop by enumerate, in case of larger number of parameters we may work more memory efficient way and use count, which generates each time incremented number starting by default from 0.

Let us run it:

$ python ereq.py
WAIT, NOT YET READY: index: 3: status: 301: length: 66, http://python-requests.org/
WAIT, NOT YET READY: index: 4: status: 301: length: 58, http://kennethreitz.com/
WAIT, NOT YET READY: index: 0: status: 301: length: None, http://www.heroku.com/
PROCESS: index: 2: status: 200: length: 7700, http://httpbin.org/
WAIT, NOT YET READY: index: 1: status: 301: length: 64, http://python-tablib.org/
WAIT, NOT YET READY: index: 4: status: 301: length: None, http://kennethreitz.org
WAIT, NOT YET READY: index: 3: status: 302: length: 0, http://docs.python-requests.org
WAIT, NOT YET READY: index: 1: status: 302: length: 0, http://docs.python-tablib.org
PROCESS: index: 3: status: 200: length: None, http://docs.python-requests.org/en/latest/
PROCESS: index: 1: status: 200: length: None, http://docs.python-tablib.org/en/latest/
PROCESS: index: 0: status: 200: length: 12064, https://www.heroku.com/
PROCESS: index: 4: status: 200: length: 10478, http://www.kennethreitz.org/

This shows, that:

requests are not processed in the order they were generated
some requests follow redirection, so hook function is called multiple times
carefully inspecting url values we can see, that no url from original list urls is reported by response, even for index 2 we got extra / appended. That is why simple lookup of response url in original list of urls would not help us.

Upvotes: 9