Reputation: 13
I have a list of 20 dataframes with sports results with this structure:
Key1 | Key2 | Variable1 | Variable2 |
---|---|---|---|
TeamA | TeamB | 20 | Nan |
TeamC | TeamA | Nan | 25 |
TeamA | TeamD | 17 | Nan |
Key1 | Key2 | Variable1 | Variable2 |
---|---|---|---|
TeamA | TeamB | Nan | 45 |
TeamB | TeamC | 90 | Nan |
TeamB | TeamD | 57 | Nan |
Key1 | Key2 | Variable1 | Variable2 |
---|---|---|---|
TeamC | TeamA | 18 | Nan |
TeamB | TeamC | Nan | 17 |
TeamC | TeamD | 84 | Nan |
I guess you get the idea: each dataframe has all the games for a particular team and several variables realted to that team, while the variables for the other team are empty. I would like to merge all the dataframes in a single one, so the Nan are replaced by the correct value. I have been trying to use pandas merge, but I could not get it right. Any suggestion?
Upvotes: 0
Views: 186
Reputation: 81
you can try it like this... This is just for creating your datafiles (I saved them as .csv and then read them in). Keep in mind that I read in your 'Nan'-values so that pandas recognizes them:
import pandas as pd
import os
path = r'C:\...'
df1_fl = r'2020-12-31_df1.csv'
df2_fl = r'2020-12-31_df2.csv'
df3_fl = r'2020-12-31_df3.csv'
df1 = pd.read_csv(os.path.join(path, df1_fl), sep=';', na_values='Nan')
df2 = pd.read_csv(os.path.join(path, df2_fl), sep=';', na_values='Nan')
df3 = pd.read_csv(os.path.join(path, df3_fl), sep=';', na_values='Nan')
Then I just replace the nan-values with a zero value and aggregate all your data together in one dataframe:
df = pd.concat([df1, df2, df3]).fillna(0)
Then the interesting part starts, grouping the data by the columns 'Key1' and 'Key2', finding the max over the group (this fills up the nan values). In the end, you need to extract out the now existing multi-index in two columns as given in the beginning dataframes with reset_index.
df_agg = df.groupby(by=['Key1', 'Key2']).max().reset_index()
Upvotes: 2