Reputation: 7023
I have multiple sub-DataFrames which I read from CSV files, and I want to combine them to one big DataFrame using pandas
.
My issue is that some of the columns in the separate sub-DataFrames show an overlap. And if they do, the values need to be inserted at the correct place in the final DataFrame.
Generally, all sub-DataFrames have an ID
column - the set
of all ID
values of all those DataFrames, should combine to the final big DataFrame's ID
column.
Each ID
has a specific CODE
assigned to it, which is consistent among all sub-DataFrames, so it could potentially be always overwritten as the values should remain the same.
I've tried every-which way, merge
, join
, concat
and even plain old loop and index, with index column, without, you name it - but, to no avail.
I want to add, that some methods create new columns with suffixes - but my intention is to combine all values from overlapping columns into a single column, so that is not an option/
Here's some sample data:
import pandas as pd
import numpy as np
np.random.seed(42)
df_1 = pd.DataFrame({
'ID':[3,4,5,6],
'CODE':[2,2,5,4],
'M1':np.random.rand(4),
'M2':np.random.rand(4)
})
df_2 = pd.DataFrame({
'ID':[8,9,10],
'CODE':[7,2,4],
'M1':np.random.rand(3),
'M2':np.random.rand(3)
})
df_3 = pd.DataFrame({
'ID':[3,4,5,6],
'CODE':[2,2,5,4],
'M3':np.random.rand(4),
'M4':np.random.rand(4)
})
df_4 = pd.DataFrame({
'ID':[8,9,10],
'CODE':[7,2,4],
'M3':np.random.rand(3),
'M4':np.random.rand(3)
})
df_5 = pd.DataFrame({
'ID':[8,9,10],
'CODE':[7,2,4],
'M5':np.random.rand(3),
'M6':np.random.rand(3)
})
Using merge
with how="outer"
I was able to merge df_1
, df_2
and df_3
with the result being as I need it.
ID CODE M1 M2 M3 M4
0 3 2 0.374540 0.156019 0.181825 0.431945
1 4 2 0.950714 0.155995 0.183405 0.291229
2 5 5 0.731994 0.058084 0.304242 0.611853
3 6 4 0.598658 0.866176 0.524756 0.139494
4 8 7 0.601115 0.969910 NaN NaN
5 9 2 0.708073 0.832443 NaN NaN
6 10 4 0.020584 0.212339 NaN NaN
But adding df_4
, the data gets appended below rather then inserted in the correct places (so there would be no NaN
s in this case):
ID CODE M1 M2 M3 M4
0 3 2 0.374540 0.156019 0.181825 0.431945
1 4 2 0.950714 0.155995 0.183405 0.291229
2 5 5 0.731994 0.058084 0.304242 0.611853
3 6 4 0.598658 0.866176 0.524756 0.139494
4 8 7 0.601115 0.969910 NaN NaN
5 9 2 0.708073 0.832443 NaN NaN
6 10 4 0.020584 0.212339 NaN NaN
7 8 7 NaN NaN 0.292145 0.785176
8 9 2 NaN NaN 0.366362 0.199674
9 10 4 NaN NaN 0.456070 0.514234
Finally, combining all DataFrames in this example should yield this result:
ID CODE M1 M2 M3 M4 M5 M6
0 3 2 0.374540 0.156019 0.181825 0.431945 NaN NaN
1 4 2 0.950714 0.155995 0.183405 0.291229 NaN NaN
2 5 5 0.731994 0.058084 0.304242 0.611853 NaN NaN
3 6 4 0.598658 0.866176 0.524756 0.139494 NaN NaN
4 8 7 0.601115 0.969910 0.292145 0.785176 0.592414 0.170524
5 9 2 0.708073 0.832443 0.366362 0.199674 0.046450 0.065051
6 10 4 0.020584 0.212339 0.456070 0.514234 0.607544 0.948885
Upvotes: 1
Views: 308
Reputation: 38415
Merge dataframes with identical ID and codes and concatenate them.
pd.concat([df_1.merge(df_3, how = 'outer'),df_2.merge(df_4, how = 'outer').merge(df_5, how = 'outer')], sort = True)
ID CODE M1 M2 M3 M4 M5 M6
0 3 2 0.374540 0.156019 0.181825 0.431945 NaN NaN
1 4 2 0.950714 0.155995 0.183405 0.291229 NaN NaN
2 5 5 0.731994 0.058084 0.304242 0.611853 NaN NaN
3 6 4 0.598658 0.866176 0.524756 0.139494 NaN NaN
4 8 7 0.601115 0.969910 0.292145 0.785176 0.592415 0.170524
5 9 2 0.708073 0.832443 0.366362 0.199674 0.046450 0.065052
6 10 4 0.020584 0.212339 0.456070 0.514234 0.607545 0.948886
Another solution using groupby. Concat all dataframes on axis 0, groupby on ID, CODE and first() returns first non-NaN value.
dfs = [df_1, df_2, df_3, df_4, df_5]
pd.concat(dfs, sort = False).groupby(['CODE', 'ID']).first().sort_index(level = 1).reset_index()
Upvotes: 1