pandas for loop, works on small dataframes hangs up on large

Question

I am using a zip code table to make a county table for a choroplot. My code works fine when I give it a small toy example, but when I give it the dataframe it hangs up. I can't share the dataframe but it's +/- 70,000 zip codes, encoded as strings and the values can be correct zipcodes, non numerical string, or numerical string that isn't a correct zipcode, each in string format. This is the code:

# import libraries
import numpy as np
import pandas as pd

# load customer email table
df = pd.read_csv('customer_April2018.csv', encoding='latin-1')

# load zip code table
zc = pd.read_csv('us_postal_codes.csv', encoding='latin-1')

def zip_toCounty(zip):
    for i, row in zc.iterrows():
        try:
            if int(zip) == zc['ZipCode'].loc[i] : return zc['County'].loc[i]    
        except ValueError : return 'No County'
    return 'No County'

g = lambda s : zip_toCounty(s)

df['County']=df['ZipCode'].apply(g)

df.to_csv('join_test.csv',index=False)

I don't get errors but the estimated completion time is 40+ hours which is completely unreasonable given the size of the data. I must be using something wrong, how can I speed up the processing? One person suggested a hash function, is that the best option here?

Toby Petty · Accepted Answer

You could do without the loop and function altogether and just merge the datasets on the ZipCode column:

new_df = pd.merge(df, zc, how="left", left_on="ZipCode", right_on="ZipCode")

pandas for loop, works on small dataframes hangs up on large

Answers (1)

Related Questions