Reputation: 27
I have three csv files we can call a, b, and c. File a has geographic information including zip codes. File b has statistical data. File c has only zip codes.
I used pandas to convert a
and b
to dataframes (a_df
and b_df
) which I used to join on information that was a shared column between those two dataframes (intermediate_df
). File c
was read and converted to a dataframe that had the zipcode as an integer type. I had to convert that to string so zipcodes are not treated as integers. However, c_df
treats that column as objects after I convert it to string which means then I cannot do a join between c_df
and intermediate_df to make final_df.
To illustrate what I mean:
a_data = pd.read_csv("a.csv")
b_data = pd.read_csv("b.csv", dtype={'zipcode': 'str'})
a_df = pd.DataFrame(a_data)
b_df = pd.DataFrame(b_data)
# file c conversion
c_data = pd.read_csv("slcsp.csv", dtype={'zipcode': 'str'})
print ("This is c data types: ", c_data.dtypes)
c_conversion = c_data['zipcode'].apply(str)
print ("This is c_conversion data types: ", c_conversion.dtypes)
c_df = pd.DataFrame(c_conversion)
print ("This is c_df data types: ", c_df.dtypes)
# Joining on the two common columns to avoid duplicates
joined_ab_df = pd.merge(a_df, a_df, on =['state', 'area'])
# Dropping columns that are not needed anymore
ab_for_analysis_df = joined_ab.drop(['county_code','name', 'area'], axis=1)
# Time to analyze this dataframe. Let's pick out only the silver values for
a specific attribute
silver_only_df = (ab_for_analysis_df[filtered_df.metal_name == 'Silver'])
# Getting second lowest value of silver only
sorted_silver = silver_only_df.groupby('zipcode')['rate'].nsmallest(2)
sorted_silver_df = sorted_silver.to_frame()
print ("We cleaned up our data. Let's join the dataframes.")
print ("Final result...")
print (c_df.dtypes)
final_df = pd.merge(sorted_silver_df,c_df, on ='zipcode')
This is what we get after running it:
This is c_data types: zipcode object
rate float64
dtype: object
This is c_conversion_data types: object
This is c_df data types: zipcode object
dtype: object
zipcode object
dtype: object
We cleaned up our data. Let's join the dataframes.
This is the final result...
KeyError: 'zipcode'
Any idea why it changed data types and how do I then fix it so it all joins in the end?
Upvotes: 1
Views: 433
Reputation: 862406
If convert to str
always output dtype
is object.
For check strings
need check type
:
print (c_data['zipcode'].apply(type))
To your last error:
Need reset_index
, because else zipcode
is index, not column:
sorted_silver_df = silver_only_df.groupby('zipcode')['rate'].nsmallest(2).reset_index()
final_df = pd.merge(sorted_silver_df,c_df, on ='zipcode')
Thanks, Andy for alternative (untested):
sorted_silver_df = silver_only_df.groupby('zipcode', as_index=False)['rate'].nsmallest(2)
final_df = pd.merge(sorted_silver_df,c_df, on ='zipcode')
Or use left_index=True
and riht_on
in merge
:
sorted_silver = silver_only_df.groupby('zipcode')['rate'].nsmallest(2)
sorted_silver_df = sorted_silver.to_frame()
final_df = pd.merge(sorted_silver_df,c_df, right_on ='zipcode', left_index=True)
Upvotes: 2