How do I increase CPU utilization for processing a dataframe in for loop?

Question

I have a dataset of about 200,000 addresses that I want to geocode (i.e., find the latitudes and longitudes of). My (simplified) code to do this is as follows:

import pandas as pd
import numpy as np

df = pd.read_csv('dataset.csv')
Latitudes = np.zeros(len(df))
Longitudes = np.zeros(len(df))

def geocode_address(address):
    ### The logic for geocoding an address
    ### and return its latitude and longitude

for i in range(len(df)):
    try:
        lat, lon = geocode_address(df.Address[i])
    except:
        lat = lon = ''
    Latitudes[i] = lat
    Longitudes[i] = lon

The problem is that each row (address) takes about 1-1.3 seconds to geocode, so this code will take at least a couple of days to finish running for the entire dataset. I am running this on a jupyter notebook in Windows 10. When I look at the task manager, I see that the process jupyter.exe is taking only 0.3-0.7% of the CPU! That is surprisingly low I think. Am I looking at the wrong process? If not, how do I increase the CPU utilization to at least, say, 50% for this code, so that the code can finish running in a few minutes or hours instead of taking a couple of days?

bruno desthuilliers · Accepted Answer

You're barking at the wrong tree. Your code is not CPU-bound, it's IO-bound (there's no intensive computation going on, most of the time is spent doing HTTP requests).

The canonical solution to such problems is parallelization (you may want to have a look at the multiprocessing module), and by itself it's quite easy to implement here since - BUT you'll still have to deal with your geocoding API rate limitations.

How do I increase CPU utilization for processing a dataframe in for loop?

Answers (2)

Related Questions