Reputation: 788
This might be considered as a duplicate of a thorough explanation of various approaches, however I can't seem to find a solution to my problem there due to a higher number of Data Frames.
I have multiple Data Frames (more than 10), each differing in one column VARX
. This is just a quick and oversimplified example:
import pandas as pd
df1 = pd.DataFrame({'depth': [0.500000, 0.600000, 1.300000],
'VAR1': [38.196202, 38.198002, 38.200001],
'profile': ['profile_1', 'profile_1','profile_1']})
df2 = pd.DataFrame({'depth': [0.600000, 1.100000, 1.200000],
'VAR2': [0.20440, 0.20442, 0.20446],
'profile': ['profile_1', 'profile_1','profile_1']})
df3 = pd.DataFrame({'depth': [1.200000, 1.300000, 1.400000],
'VAR3': [15.1880, 15.1820, 15.1820],
'profile': ['profile_1', 'profile_1','profile_1']})
Each df
has same or different depths for the same profiles, so
I need to create a new DataFrame which would merge all separate ones, where the key columns for the operation are depth
and profile
, with all appearing depth values for each profile.
The VARX
value should be therefore NaN
where there is no depth measurement of that variable for that profile.
The result should be a thus a new, compressed DataFrame with all VARX
as additional columns to the depth
and profile
ones, something like this:
name_profile depth VAR1 VAR2 VAR3
profile_1 0.500000 38.196202 NaN NaN
profile_1 0.600000 38.198002 0.20440 NaN
profile_1 1.100000 NaN 0.20442 NaN
profile_1 1.200000 NaN 0.20446 15.1880
profile_1 1.300000 38.200001 NaN 15.1820
profile_1 1.400000 NaN NaN 15.1820
Note that the actual number of profiles is much, much bigger.
Any ideas?
Upvotes: 20
Views: 36804
Reputation: 88226
A simple way is with a combination of functools.partial
/reduce
.
Firstly partial
allows to "freeze" some portion of a function’s arguments and/or keywords resulting in a new object with a simplified signature. Then with reduce
we can apply cumulatively the new partial object to the items of iterable (list of dataframes here):
from functools import partial, reduce
dfs = [df1, df2, df3]
merge = partial(pd.merge, on=['depth', 'profile'], how='outer')
reduce(merge, dfs)
depth VAR1 profile VAR2 VAR3
0 0.6 38.198002 profile_1 0.20440 NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
Upvotes: 15
Reputation: 828
You can also use:
dfs = [df1, df2, df3]
df = pd.merge(dfs[0], dfs[1], left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
for d in dfs[2:]:
df = pd.merge(df, d, left_on=['depth','profile'], right_on=['depth','profile'], how='outer')
depth VAR1 profile VAR2 VAR3
0 0.5 38.196202 profile_1 NaN NaN
1 0.6 38.198002 profile_1 0.20440 NaN
2 1.3 38.200001 profile_1 NaN 15.182
3 1.1 NaN profile_1 0.20442 NaN
4 1.2 NaN profile_1 0.20446 15.188
5 1.4 NaN profile_1 NaN 15.182
Upvotes: 1
Reputation: 96
Why not concatenate all the Data Frames, melt, then reform them using your ids? There might be a more efficient way to do this, but this works.
df=pd.melt(pd.concat([df1,df2,df3]),id_vars=['profile','depth'])
df_pivot=df.pivot_table(index=['profile','depth'],columns='variable',values='value')
Where df_pivot
will be
variable VAR1 VAR2 VAR3
profile depth
profile_1 0.5 38.196202 NaN NaN
0.6 38.198002 0.20440 NaN
1.1 NaN 0.20442 NaN
1.2 NaN 0.20446 15.188
1.3 38.200001 NaN 15.182
1.4 NaN NaN 15.182
Upvotes: 1
Reputation: 11063
I would use append.
>>> df1.append(df2).append(df3).sort_values('depth')
VAR1 VAR2 VAR3 depth profile
0 38.196202 NaN NaN 0.5 profile_1
1 38.198002 NaN NaN 0.6 profile_1
0 NaN 0.20440 NaN 0.6 profile_1
1 NaN 0.20442 NaN 1.1 profile_1
2 NaN 0.20446 NaN 1.2 profile_1
0 NaN NaN 15.188 1.2 profile_1
2 38.200001 NaN NaN 1.3 profile_1
1 NaN NaN 15.182 1.3 profile_1
2 NaN NaN 15.182 1.4 profile_1
Obviously if you have a lot of dataframes, just make a list and loop through them.
Upvotes: 2
Reputation: 107567
Consider setting index on each data frame and then run the horizontal merge with pd.concat
:
dfs = [df.set_index(['profile', 'depth']) for df in [df1, df2, df3]]
print(pd.concat(dfs, axis=1).reset_index())
# profile depth VAR1 VAR2 VAR3
# 0 profile_1 0.5 38.198002 NaN NaN
# 1 profile_1 0.6 38.198002 0.20440 NaN
# 2 profile_1 1.1 NaN 0.20442 NaN
# 3 profile_1 1.2 NaN 0.20446 15.188
# 4 profile_1 1.3 38.200001 NaN 15.182
# 5 profile_1 1.4 NaN NaN 15.182
Upvotes: 22