Reputation: 1356
I am fairly certain I know why I'm getting nan values in DF2 after I merge 2 DFs but I cannot understand how to fix the problem. My data looks like this below. The DF2 df is a very large DF with year, month and day going back to the year 1991. I'm trying to match or merge a subset of this DF2 with a much smaller year/month/day subset in DF1 from the year == 2018. Usually this problem with 'nan' values is from some inhomogeneity with the column names in df1, df2 not matching or similar but my columns look the same to me.
DF1:
Out[2]:
plant_name obsvals year month day hour
0 COPPER CROSSING 0.0 2018 1 1 2
1 COPPER CROSSING 0.0 2018 1 1 3
2 COPPER CROSSING 0.0 2018 1 1 4
3 COPPER CROSSING 0.0 2018 1 1 5
4 COPPER CROSSING 0.0 2018 1 1 6
My second df looks like this: DF2:
plant_name business_name maint_region_name modelvals dataset year month day hour
245448 COPPER CROSSING UNITED STATES CENTRAL 3839.9 ERA5 2018 1 1 0
245449 COPPER CROSSING UNITED STATES CENTRAL 0.0 ERA5 2018 1 1 1
245450 COPPER CROSSING UNITED STATES CENTRAL 0.0 ERA5 2018 1 1 2
245451 COPPER CROSSING UNITED STATES CENTRAL 0.0 ERA5 2018 1 1 3
245452 COPPER CROSSING UNITED STATES CENTRAL 0.0 ERA5 2018 1 1 4
My idea to merge these two DFs is this but it gives 'nan' values for DF2 in the 'modelvals' column using how ="inner" and how = "left". Thanks for your ideas how to merge these 2 DFs.
DF3 = DF1.merge(DF2, on=["plant_name", "year", "month", "day", "hour"], how="inner")
output looks like this:
fdf = DF3[(DF3['year'] == 2018) & (DF3['month'] == 1) & (DF3['day']==1) ]
plant_name obsvals year month day hour business_name maint_region_name modelvals dataset
0 COPPER CROSSING 0.0 2018 1 1 2 UNITED STATES CENTRAL NaN ERA5
1 COPPER CROSSING 0.0 2018 1 1 3 UNITED STATES CENTRAL NaN ERA5
2 COPPER CROSSING 0.0 2018 1 1 4 UNITED STATES CENTRAL NaN ERA5
3 COPPER CROSSING 0.0 2018 1 1 5 UNITED STATES CENTRAL NaN ERA5
4 COPPER CROSSING 0.0 2018 1 1 6 UNITED STATES CENTRAL NaN ERA5
You can see the 'nan' values in the 'modelvals' column. I have included below the datatypes for both pandas:
DF1:
plant_name object
obsvals float64
year int64
month int64
day int64
hour int64
dtype: object
DF2:
plant_name object
business_name object
maint_region_name object
modelvals float64
dataset object
year int64
month int64
day int64
hour int64
dtype: object
I'm merging on 'plant_name','year','month','day','hour' which look to have the same datatypes. In a previous step in the code I am removing 'nan' values from DF1 df before the merge step shown above.
Upvotes: 0
Views: 2796
Reputation: 12
First you shoud drop nan values using df.dropna() . when you drop nan values then there indexes may not in proper order.
so run df.reindex(axis = 0)
Upvotes: 1