Manu
Manu

Reputation: 13

Python - merge dataframes to replace missing values with the actual ones

I have a list of 20 dataframes with sports results with this structure:

Key1 Key2 Variable1 Variable2
TeamA TeamB 20 Nan
TeamC TeamA Nan 25
TeamA TeamD 17 Nan
Key1 Key2 Variable1 Variable2
TeamA TeamB Nan 45
TeamB TeamC 90 Nan
TeamB TeamD 57 Nan
Key1 Key2 Variable1 Variable2
TeamC TeamA 18 Nan
TeamB TeamC Nan 17
TeamC TeamD 84 Nan

I guess you get the idea: each dataframe has all the games for a particular team and several variables realted to that team, while the variables for the other team are empty. I would like to merge all the dataframes in a single one, so the Nan are replaced by the correct value. I have been trying to use pandas merge, but I could not get it right. Any suggestion?

Upvotes: 0

Views: 186

Answers (1)

kenny_123
kenny_123

Reputation: 81

you can try it like this... This is just for creating your datafiles (I saved them as .csv and then read them in). Keep in mind that I read in your 'Nan'-values so that pandas recognizes them:

import pandas as pd
import os

path = r'C:\...'
df1_fl = r'2020-12-31_df1.csv'
df2_fl = r'2020-12-31_df2.csv'
df3_fl = r'2020-12-31_df3.csv'

df1 = pd.read_csv(os.path.join(path, df1_fl), sep=';', na_values='Nan')
df2 = pd.read_csv(os.path.join(path, df2_fl), sep=';', na_values='Nan')
df3 = pd.read_csv(os.path.join(path, df3_fl), sep=';', na_values='Nan')

Then I just replace the nan-values with a zero value and aggregate all your data together in one dataframe:

df = pd.concat([df1, df2, df3]).fillna(0)

Then the interesting part starts, grouping the data by the columns 'Key1' and 'Key2', finding the max over the group (this fills up the nan values). In the end, you need to extract out the now existing multi-index in two columns as given in the beginning dataframes with reset_index.

df_agg = df.groupby(by=['Key1', 'Key2']).max().reset_index()

Upvotes: 2

Related Questions