Python - merge dataframes to replace missing values with the actual ones

Question

I have a list of 20 dataframes with sports results with this structure:

Key1	Key2	Variable1	Variable2
TeamA	TeamB	20	Nan
TeamC	TeamA	Nan	25
TeamA	TeamD	17	Nan

Key1	Key2	Variable1	Variable2
TeamA	TeamB	Nan	45
TeamB	TeamC	90	Nan
TeamB	TeamD	57	Nan

Key1	Key2	Variable1	Variable2
TeamC	TeamA	18	Nan
TeamB	TeamC	Nan	17
TeamC	TeamD	84	Nan

I guess you get the idea: each dataframe has all the games for a particular team and several variables realted to that team, while the variables for the other team are empty. I would like to merge all the dataframes in a single one, so the Nan are replaced by the correct value. I have been trying to use pandas merge, but I could not get it right. Any suggestion?

kenny_123 · Accepted Answer

you can try it like this... This is just for creating your datafiles (I saved them as .csv and then read them in). Keep in mind that I read in your 'Nan'-values so that pandas recognizes them:

import pandas as pd
import os

path = r'C:\...'
df1_fl = r'2020-12-31_df1.csv'
df2_fl = r'2020-12-31_df2.csv'
df3_fl = r'2020-12-31_df3.csv'

df1 = pd.read_csv(os.path.join(path, df1_fl), sep=';', na_values='Nan')
df2 = pd.read_csv(os.path.join(path, df2_fl), sep=';', na_values='Nan')
df3 = pd.read_csv(os.path.join(path, df3_fl), sep=';', na_values='Nan')

Then I just replace the nan-values with a zero value and aggregate all your data together in one dataframe:

df = pd.concat([df1, df2, df3]).fillna(0)

Then the interesting part starts, grouping the data by the columns 'Key1' and 'Key2', finding the max over the group (this fills up the nan values). In the end, you need to extract out the now existing multi-index in two columns as given in the beginning dataframes with reset_index.

df_agg = df.groupby(by=['Key1', 'Key2']).max().reset_index()

Python - merge dataframes to replace missing values with the actual ones

Answers (1)

Related Questions