user2100039
user2100039

Reputation: 1356

Pandas Merge Gives nan values

I am fairly certain I know why I'm getting nan values in DF2 after I merge 2 DFs but I cannot understand how to fix the problem. My data looks like this below. The DF2 df is a very large DF with year, month and day going back to the year 1991. I'm trying to match or merge a subset of this DF2 with a much smaller year/month/day subset in DF1 from the year == 2018. Usually this problem with 'nan' values is from some inhomogeneity with the column names in df1, df2 not matching or similar but my columns look the same to me.

DF1:

Out[2]: 
        plant_name  obsvals  year  month  day  hour
0  COPPER CROSSING      0.0  2018      1    1     2
1  COPPER CROSSING      0.0  2018      1    1     3
2  COPPER CROSSING      0.0  2018      1    1     4
3  COPPER CROSSING      0.0  2018      1    1     5
4  COPPER CROSSING      0.0  2018      1    1     6

My second df looks like this: DF2:

             plant_name  business_name maint_region_name  modelvals     dataset  year  month  day  hour
245448  COPPER CROSSING  UNITED STATES           CENTRAL     3839.9  ERA5        2018      1    1     0
245449  COPPER CROSSING  UNITED STATES           CENTRAL        0.0  ERA5        2018      1    1     1
245450  COPPER CROSSING  UNITED STATES           CENTRAL        0.0  ERA5        2018      1    1     2
245451  COPPER CROSSING  UNITED STATES           CENTRAL        0.0  ERA5        2018      1    1     3
245452  COPPER CROSSING  UNITED STATES           CENTRAL        0.0  ERA5        2018      1    1     4

My idea to merge these two DFs is this but it gives 'nan' values for DF2 in the 'modelvals' column using how ="inner" and how = "left". Thanks for your ideas how to merge these 2 DFs.

DF3 = DF1.merge(DF2, on=["plant_name", "year", "month", "day", "hour"], how="inner")

output looks like this:

fdf = DF3[(DF3['year'] == 2018) & (DF3['month'] == 1) & (DF3['day']==1) ]

         plant_name  obsvals  year  month  day  hour  business_name maint_region_name  modelvals     dataset
0   COPPER CROSSING      0.0  2018      1    1     2  UNITED STATES           CENTRAL        NaN  ERA5      
1   COPPER CROSSING      0.0  2018      1    1     3  UNITED STATES           CENTRAL        NaN  ERA5      
2   COPPER CROSSING      0.0  2018      1    1     4  UNITED STATES           CENTRAL        NaN  ERA5      
3   COPPER CROSSING      0.0  2018      1    1     5  UNITED STATES           CENTRAL        NaN  ERA5      
4   COPPER CROSSING      0.0  2018      1    1     6  UNITED STATES           CENTRAL        NaN  ERA5      

You can see the 'nan' values in the 'modelvals' column. I have included below the datatypes for both pandas:

DF1:
plant_name     object
obsvals       float64
year            int64
month           int64
day             int64
hour            int64
dtype: object

DF2:
plant_name            object
business_name         object
maint_region_name     object
modelvals            float64
dataset               object
year                   int64
month                  int64
day                    int64
hour                   int64
dtype: object 

I'm merging on 'plant_name','year','month','day','hour' which look to have the same datatypes. In a previous step in the code I am removing 'nan' values from DF1 df before the merge step shown above.

Upvotes: 0

Views: 2796

Answers (1)

Furqan
Furqan

Reputation: 12

First you shoud drop nan values using df.dropna() . when you drop nan values then there indexes may not in proper order.

so run df.reindex(axis = 0)

Upvotes: 1

Related Questions