Ajay H
Ajay H

Reputation: 824

pandas join gives NaN values

I want to join 2 DataFrames

Zipcode Database (first 10 entries)

    0   zip_code           City       State       County  Population
0   0      90001    Los Angeles  California  Los Angeles       54481
1   1      90002    Los Angeles  California  Los Angeles       44584
2   2      90003    Los Angeles  California  Los Angeles       58187
3   3      90004    Los Angeles  California  Los Angeles       67850
4   4      90005    Los Angeles  California  Los Angeles       43014
5   5      90006    Los Angeles  California  Los Angeles       62765
6   6      90007    Los Angeles  California  Los Angeles       45021
7   7      90008    Los Angeles  California  Los Angeles       30840
8   8      90009    Los Angeles  California  Los Angeles           -
9   9      90010    Los Angeles  California  Los Angeles        1943

And data (first 10 entries)

    buyer                               zip_code
0   SWEENEY,THOMAS R & MICHELLE H       NaN
1   DOUGHERTY,HERBERT III & JENNIFER M  NaN
2   WEST COAST RLTY SVCS INC            NaN
3   LOVE,JULIE M                        NaN
4   SAHAR,DAVID NaN
5   SILBERSTERN,BRADLEY E TRUST         91199
6   LEE,SUSAN & JIMMY C                 92025
7   FRAZZANO REAL ESTATE I NC           NaN
8   RUV INVESTMENTS LLC                 91730
9   KAOS KAPITAL LLC                    NaN

So the final table should have [buyer, zip_code, City, County]. I'm joining with respect to Zip code.

data_2 = data.join(zipcode_database[['City', 'County', 'zip_code']].set_index('zip_code'), on='zip_code')

But the city and county columns are NaN even for the tuples in data where zipcode is actually present.

    buyer   zip_code    City    County
10  LANDON AVE TRUST 37736  NaN NaN NaN
11  UMAR,AHMAD  NaN NaN NaN
12  3 JPS INC   90717   NaN NaN
13  T & L HOLDINGS INC  95610   NaN NaN
14  CAHP HOLDINGS LLC   90808   NaN NaN
15  REBUILDING TOGETHER LONG BEACH  92344   NaN NaN
16  COLFIN AI-CA 4 LLC  NaN NaN NaN
17  GUTIERREZ,HUGO  91381   NaN NaN
18  VALBRIDGE CAP GOLDEN GATE FUND  NaN NaN NaN
19  SOLARES,OSCAR   92570   NaN NaN

Why is this the case? The zipcode database has all zipcodes from 90001 - 999950.

My first thought is the datatype of "zip_code" in both are different:

print(zipcode_database['zip_code'].dtype)
print(data['zip_code'].dtype)

Output:

int64
object

Thought of typecasting with astype, but this does not work with NaN values. Any thoughts?

Upvotes: 0

Views: 1277

Answers (1)

Grr
Grr

Reputation: 16099

You can cast NaN values to float types, but not int. In your case I would cast the zip_code field in both DataFrames to a float and then join.

zipcode_database.zip_code = zipcode_database.zip_code.astype(float)
data.zip_code = data.zip_code.astype(float)
data_2 = data.join(zipcode_database[['City', 'County', 'zip_code']].set_index('zip_code'), on='zip_code')

I can't reproduce anything meaningful from your example data (no matching zip codes), but that should fix the issue.

Upvotes: 2

Related Questions