Reputation: 5764
I got nine different dataframes that I want to join (or merge, or update) into one single dataframe. Each of those original dataframes consists of two columns only, an in seconds and value for that observation. The data looks like this:
Filter_type Time
0 0.0 6333.137168
Filter_type Time
0 0.0 6347.422576
Filter_type Time
0 0.0 7002.406185
Filter_type Time
0 0.0 7015.845717
Sign_pos_X Time
0 11.5 6333.137168
1 25.0 6347.422576
2 25.5 7002.406185
3 38.0 7015.845717
Sign_pos_Y Time
0 -3.0 6333.137168
1 8.0 6347.422576
2 -7.5 7002.406185
3 -0.5 7015.845717
Sign_pos_Z Time
0 1.0 6333.137168
1 1.0 6347.422576
2 1.0 7002.406185
3 7.5 7015.845717
Supplementary_sign_type Time
0 0.0 6333.137168
1 0.0 6347.422576
2 0.0 7002.406185
3 0.0 7015.845717
Time vision_only_sign_type
0 6333.137168 7.0
1 6347.422576 9.0
2 7002.406185 9.0
3 7015.845717 35.0
Since I want to join all of them into one single dataframe, I tried the following:
df2 = None
for cell in df['Frames']:
if not isinstance(cell, list):
continue
df_ = pd.DataFrame(cell)
if df2 is None:
# first iteration
df2 = df_
continue
df2 = df2.merge(df_, on='Offset', how='outer')
#df2 = df2.join(df_)
#df2.update(df_, join='outer')
df2
The problem is, that the first four dataframes have the same name of the value column while the others don't. Therefore the result has three columns with the prefix 'Filter_type':
+----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------+
| | Filter_type_x | Offset | Filter_type_y | Filter_type_x | Filter_type_y | Sign_pos_X | Sign_pos_Y | Sign_pos_Z | Supplementary_sign_type | vision_only_sign_type |
|----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------|
| 0 | 0 | 6333.14 | nan | nan | nan | 11.5 | -3 | 1 | 0 | 7 |
| 1 | nan | 6347.42 | 0 | nan | nan | 25 | 8 | 1 | 0 | 9 |
| 2 | nan | 7002.41 | nan | 0 | nan | 25.5 | -7.5 | 1 | 0 | 9 |
| 3 | nan | 7015.85 | nan | nan | 0 | 38 | -0.5 | 7.5 | 0 | 35 |
+----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------+
My question is: How can I force the merge/join to join all the 'Filter_type' columns into one. You can see that each row has only one value in all of such columns while the others are NaN. Result should look like this (having only one merged column 'Filter_type'):
+----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------+
| | Offset | Sign_pos_X | Sign_pos_Y | Sign_pos_Z | Supplementary_sign_type | vision_only_sign_type | Filter_type |
|----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------|
| 0 | 6333.14 | 11.5 | -3 | 1 | 0 | 7 | 0 |
| 1 | 6347.42 | 25 | 8 | 1 | 0 | 9 | 0 |
| 2 | 7002.41 | 25.5 | -7.5 | 1 | 0 | 9 | 0 |
| 3 | 7015.85 | 38 | -0.5 | 7.5 | 0 | 35 | 0 |
+----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------+
Upvotes: 1
Views: 2036
Reputation: 880907
Calling pd.merge
in a loop leads to quadratic copying and slow performance when the length or sheer number of DataFrames is large. So avoid this if possible.
Here, it appears that we want to concatenate the DataFrames vertically when they have Time
and Filter_type
columns, and we wish to concatenate horizontally when the DataFrames lack a Filter_type
column:
frames = [df.set_index('Time') for df in frames]
filter_type_frames = pd.concat(frames[:4], axis=0)
result = pd.concat([filter_type_frames] + frames[4:], axis=1)
result = result.reset_index('Time')
print(result)
Calling pd.concat
with axis=0
concatenates vertically, with axis=1
horizontally.
Since pd.concat
accepts a list of DataFrames and can concatenate them all at once without iteratively creating intermediate DataFrames, pd.concat
avoids the quadratic copying problem.
Since pd.concat
aligns the indexes, by setting the index to Time
, the data gets aligned properly based on Time
.
See below for a runnable example.
There is another way to solve the problem, and in a way it is prettier, but it calls pd.merge
in a loop and therefore it may suffer from poor performance for the reason explained above.
The idea, however, is this: By default, pd.merge(left, right)
merges on all column labels that left
and right
share in common. So if you omit on='Offset'
(or `on='Time'?) and use
df2 = df2.merge(df_, how='outer')
then the merge will join on both Offset
(or Time
) and Filter_type
if both exist.
You could further simplify the loop by using
import functools
df2 = functools.reduce(functools.partial(pd.merge, how='outer'), df['Frames'])
The loop is hidden inside functools.reduce
, but in essence, pd.merge
is still being called in a loop. So while this is pretty, it may not be performant.
import functools
import pandas as pd
frames = [pd.DataFrame({'Filter_type': [0.0], 'Time': [6333.137168]}),
pd.DataFrame({'Filter_type': [0.0], 'Time': [6347.422576]}),
pd.DataFrame({'Filter_type': [0.0], 'Time': [7002.406185]}),
pd.DataFrame({'Filter_type': [0.0], 'Time': [7015.845717]}),
pd.DataFrame({'Sign_pos_X': [11.5, 25.0, 25.5, 38.0],
'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
pd.DataFrame({'Sign_pos_Y': [-3.0, 8.0, -7.5, -0.5],
'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
pd.DataFrame({'Sign_pos_Z': [1.0, 1.0, 1.0, 7.5],
'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
pd.DataFrame({'Supplementary_sign_type': [0.0, 0.0, 0.0, 0.0],
'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
pd.DataFrame({'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717],
'vision_only_sign_type': [7.0, 9.0, 9.0, 35.0]})]
result = functools.reduce(functools.partial(pd.merge, how='outer'), frames)
print(result)
frames = [df.set_index('Time') for df in frames]
A = pd.concat(frames[:4], axis=0)
result = pd.concat([A] + frames[4:], axis=1)
result = result.reset_index('Time')
print(result)
# same result
prints
Filter_type Time Sign_pos_X Sign_pos_Y Sign_pos_Z \
0 0.0 6333.137168 11.5 -3.0 1.0
1 0.0 6347.422576 25.0 8.0 1.0
2 0.0 7002.406185 25.5 -7.5 1.0
3 0.0 7015.845717 38.0 -0.5 7.5
Supplementary_sign_type vision_only_sign_type
0 0.0 7.0
1 0.0 9.0
2 0.0 9.0
3 0.0 35.0
Upvotes: 2