Reputation: 1356
i have 2 unequal length pandas that I am trying to combine based on their matching hour, month, year, and plant_name. I'm not sure whether merge/concat/join is most efficient or uses less memory since my actual dataset is very large and I'm getting unexpected results with attempts like shown at the bottom here. As an example, my data looks like this for dataframe 1 (df1):
plant_name business_name maint_region_name wind_speed_ms dataset year month day hour
0 CAETITE I BRAZIL BRAZIL 7.88 ERA5 2021 5 31 21
1 CAETITE I BRAZIL BRAZIL 7.95 ERA5 2021 5 31 20
2 RIO DO FOGO BRAZIL BRAZIL 7.72 ERA5 2021 1 31 19
3 RIO DO FOGO BRAZIL BRAZIL 7.68 ERA5 2021 1 31 18
4 CAETITE I BRAZIL BRAZIL 7.41 ERA5 2021 1 31 17
5 CAETITE I BRAZIL BRAZIL 7.32 ERA5 2021 1 31 16
And, dataframe 2 or df2:
plant_name vel year month day hour
0 CAETITE I NaN 2021 1 31 19
1 CAETITE I NaN 2021 1 31 18
2 CAETITE I 5.42 2021 1 31 17
3 CAETITE I NaN 2021 1 31 16
And, the final df or df_combined should look like this:
plant_name business_name maint_region_name wind_speed_ms dataset year month day hour vel
0 CAETITE I BRAZIL BRAZIL 7.41 ERA5 2021 1 31 17 5.42
1 CAETITE I BRAZIL BRAZIL 7.32 ERA5 2021 1 31 16 NaN
Ive tried many things like this below but each attempt is causing "df3" to have considerably large number of rows and more columns in my actual data than I expect so I must be doing something wrong with the approach below:
df3 = df2.merge(df1[["plant_name","year","month","day","hour","wind_speed_ms"]], on=["plant_name"], how="left")
thank you for your time.
Upvotes: 0
Views: 21
Reputation: 606
What you are looking for is an inner-join here. Also you need to add year, month, day and hour to the join-key. The following should give you the right answer:
df1.merge(df2, on=["plant_name", "year", "month", "day", "hour"], how="inner")
Upvotes: 1