Reputation: 365
I am using a zip code table to make a county table for a choroplot. My code works fine when I give it a small toy example, but when I give it the dataframe it hangs up. I can't share the dataframe but it's +/- 70,000 zip codes, encoded as strings and the values can be correct zipcodes, non numerical string, or numerical string that isn't a correct zipcode, each in string format. This is the code:
# import libraries
import numpy as np
import pandas as pd
# load customer email table
df = pd.read_csv('customer_April2018.csv', encoding='latin-1')
# load zip code table
zc = pd.read_csv('us_postal_codes.csv', encoding='latin-1')
def zip_toCounty(zip):
for i, row in zc.iterrows():
try:
if int(zip) == zc['ZipCode'].loc[i] : return zc['County'].loc[i]
except ValueError : return 'No County'
return 'No County'
g = lambda s : zip_toCounty(s)
df['County']=df['ZipCode'].apply(g)
df.to_csv('join_test.csv',index=False)
I don't get errors but the estimated completion time is 40+ hours which is completely unreasonable given the size of the data. I must be using something wrong, how can I speed up the processing? One person suggested a hash function, is that the best option here?
Upvotes: 2
Views: 266
Reputation: 4660
You could do without the loop and function altogether and just merge the datasets on the ZipCode column:
new_df = pd.merge(df, zc, how="left", left_on="ZipCode", right_on="ZipCode")
Upvotes: 1