Kristada673
Kristada673

Reputation: 3744

How do I increase CPU utilization for processing a dataframe in for loop?

I have a dataset of about 200,000 addresses that I want to geocode (i.e., find the latitudes and longitudes of). My (simplified) code to do this is as follows:

import pandas as pd
import numpy as np

df = pd.read_csv('dataset.csv')
Latitudes = np.zeros(len(df))
Longitudes = np.zeros(len(df))

def geocode_address(address):
    ### The logic for geocoding an address
    ### and return its latitude and longitude

for i in range(len(df)):
    try:
        lat, lon = geocode_address(df.Address[i])
    except:
        lat = lon = ''
    Latitudes[i] = lat
    Longitudes[i] = lon

The problem is that each row (address) takes about 1-1.3 seconds to geocode, so this code will take at least a couple of days to finish running for the entire dataset. I am running this on a jupyter notebook in Windows 10. When I look at the task manager, I see that the process jupyter.exe is taking only 0.3-0.7% of the CPU! That is surprisingly low I think. Am I looking at the wrong process? If not, how do I increase the CPU utilization to at least, say, 50% for this code, so that the code can finish running in a few minutes or hours instead of taking a couple of days?

Upvotes: 0

Views: 2811

Answers (2)

Kristada673
Kristada673

Reputation: 3744

I solved this issue based on Bruno's advice, by partitioning the data into 10 subsets of 20k rows each. Then I ran 10 jupyter notebooks with the same code on each of the partitions. This is basically "stone age parallel processing", but it did solve the issue in a simple way - the whole job finished in about 5 hours.

The key thing to keep in mind though is to notice how much of the CPU is being taken by each notebook - in my case, it was about 1%. So, theoretically, I could have partitioned the data into, say, 50 parts and the whole task would have finished in about an hour. However, if each notebook was taking, say, 10% of the CPU, then at max I would have partitioned the data into 6-7 parts, as I would like to keep at least 30-40% of the CPU for other apps and processes.

I would love to know if there's a way to automate this process - i.e., to find what's the max number of partitions such that when running the same notebook on each of those partitions, the total CPU usage does not exceed a specified threshold. And then, of course, partition the data and run the code on each of them.

Upvotes: 0

bruno desthuilliers
bruno desthuilliers

Reputation: 77912

You're barking at the wrong tree. Your code is not CPU-bound, it's IO-bound (there's no intensive computation going on, most of the time is spent doing HTTP requests).

The canonical solution to such problems is parallelization (you may want to have a look at the multiprocessing module), and by itself it's quite easy to implement here since - BUT you'll still have to deal with your geocoding API rate limitations.

Upvotes: 2

Related Questions