Matthias
Matthias

Reputation: 5764

Pandas: join dataframes and merge values of identical columns

I got nine different dataframes that I want to join (or merge, or update) into one single dataframe. Each of those original dataframes consists of two columns only, an in seconds and value for that observation. The data looks like this:

   Filter_type         Time
0          0.0  6333.137168


   Filter_type         Time
0          0.0  6347.422576


   Filter_type         Time
0          0.0  7002.406185


   Filter_type         Time
0          0.0  7015.845717


   Sign_pos_X         Time
0        11.5  6333.137168
1        25.0  6347.422576
2        25.5  7002.406185
3        38.0  7015.845717


   Sign_pos_Y         Time
0        -3.0  6333.137168
1         8.0  6347.422576
2        -7.5  7002.406185
3        -0.5  7015.845717


   Sign_pos_Z         Time
0         1.0  6333.137168
1         1.0  6347.422576
2         1.0  7002.406185
3         7.5  7015.845717


   Supplementary_sign_type         Time
0                      0.0  6333.137168
1                      0.0  6347.422576
2                      0.0  7002.406185
3                      0.0  7015.845717


          Time  vision_only_sign_type
0  6333.137168                    7.0
1  6347.422576                    9.0
2  7002.406185                    9.0
3  7015.845717                   35.0

Since I want to join all of them into one single dataframe, I tried the following:

df2 = None

for cell in df['Frames']:
    if not isinstance(cell, list):
        continue

    df_ = pd.DataFrame(cell)
    if df2 is None:
        # first iteration
        df2 = df_
        continue

    df2 = df2.merge(df_, on='Offset', how='outer') 
    #df2 = df2.join(df_)
    #df2.update(df_, join='outer')

df2

The problem is, that the first four dataframes have the same name of the value column while the others don't. Therefore the result has three columns with the prefix 'Filter_type':

+----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------+
|    |   Filter_type_x |   Offset |   Filter_type_y |   Filter_type_x |   Filter_type_y |   Sign_pos_X |   Sign_pos_Y |   Sign_pos_Z |   Supplementary_sign_type |   vision_only_sign_type |
|----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------|
|  0 |               0 |  6333.14 |             nan |             nan |             nan |         11.5 |         -3   |          1   |                         0 |                       7 |
|  1 |             nan |  6347.42 |               0 |             nan |             nan |         25   |          8   |          1   |                         0 |                       9 |
|  2 |             nan |  7002.41 |             nan |               0 |             nan |         25.5 |         -7.5 |          1   |                         0 |                       9 |
|  3 |             nan |  7015.85 |             nan |             nan |               0 |         38   |         -0.5 |          7.5 |                         0 |                      35 |
+----+-----------------+----------+-----------------+-----------------+-----------------+--------------+--------------+--------------+---------------------------+-------------------------+

My question is: How can I force the merge/join to join all the 'Filter_type' columns into one. You can see that each row has only one value in all of such columns while the others are NaN. Result should look like this (having only one merged column 'Filter_type'):

+----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------+
|    |   Offset |   Sign_pos_X |   Sign_pos_Y |   Sign_pos_Z |   Supplementary_sign_type |   vision_only_sign_type |   Filter_type |
|----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------|
|  0 |  6333.14 |         11.5 |         -3   |          1   |                         0 |                       7 |             0 |
|  1 |  6347.42 |         25   |          8   |          1   |                         0 |                       9 |             0 |
|  2 |  7002.41 |         25.5 |         -7.5 |          1   |                         0 |                       9 |             0 |
|  3 |  7015.85 |         38   |         -0.5 |          7.5 |                         0 |                      35 |             0 |
+----+----------+--------------+--------------+--------------+---------------------------+-------------------------+---------------+

Upvotes: 1

Views: 2036

Answers (1)

unutbu
unutbu

Reputation: 880907

Calling pd.merge in a loop leads to quadratic copying and slow performance when the length or sheer number of DataFrames is large. So avoid this if possible.

Here, it appears that we want to concatenate the DataFrames vertically when they have Time and Filter_type columns, and we wish to concatenate horizontally when the DataFrames lack a Filter_type column:

frames = [df.set_index('Time') for df in frames]
filter_type_frames = pd.concat(frames[:4], axis=0)
result = pd.concat([filter_type_frames] + frames[4:], axis=1)
result = result.reset_index('Time')
print(result)

Calling pd.concat with axis=0 concatenates vertically, with axis=1 horizontally. Since pd.concat accepts a list of DataFrames and can concatenate them all at once without iteratively creating intermediate DataFrames, pd.concat avoids the quadratic copying problem.

Since pd.concat aligns the indexes, by setting the index to Time, the data gets aligned properly based on Time.

See below for a runnable example.


There is another way to solve the problem, and in a way it is prettier, but it calls pd.merge in a loop and therefore it may suffer from poor performance for the reason explained above.

The idea, however, is this: By default, pd.merge(left, right) merges on all column labels that left and right share in common. So if you omit on='Offset' (or `on='Time'?) and use

df2 = df2.merge(df_, how='outer') 

then the merge will join on both Offset (or Time) and Filter_type if both exist.


You could further simplify the loop by using

import functools
df2 = functools.reduce(functools.partial(pd.merge, how='outer'), df['Frames'])

The loop is hidden inside functools.reduce, but in essence, pd.merge is still being called in a loop. So while this is pretty, it may not be performant.


import functools
import pandas as pd
frames = [pd.DataFrame({'Filter_type': [0.0], 'Time': [6333.137168]}),
          pd.DataFrame({'Filter_type': [0.0], 'Time': [6347.422576]}),
          pd.DataFrame({'Filter_type': [0.0], 'Time': [7002.406185]}),
          pd.DataFrame({'Filter_type': [0.0], 'Time': [7015.845717]}),
          pd.DataFrame({'Sign_pos_X': [11.5, 25.0, 25.5, 38.0],
                        'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
          pd.DataFrame({'Sign_pos_Y': [-3.0, 8.0, -7.5, -0.5],
                        'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
          pd.DataFrame({'Sign_pos_Z': [1.0, 1.0, 1.0, 7.5],
                        'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
          pd.DataFrame({'Supplementary_sign_type': [0.0, 0.0, 0.0, 0.0],
                        'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717]}),
          pd.DataFrame({'Time': [6333.137168, 6347.422576, 7002.406185, 7015.845717],
                        'vision_only_sign_type': [7.0, 9.0, 9.0, 35.0]})]

result = functools.reduce(functools.partial(pd.merge, how='outer'), frames)
print(result)

frames = [df.set_index('Time') for df in frames]
A = pd.concat(frames[:4], axis=0)
result = pd.concat([A] + frames[4:], axis=1)
result = result.reset_index('Time')
print(result)
# same result

prints

   Filter_type         Time  Sign_pos_X  Sign_pos_Y  Sign_pos_Z  \
0          0.0  6333.137168        11.5        -3.0         1.0   
1          0.0  6347.422576        25.0         8.0         1.0   
2          0.0  7002.406185        25.5        -7.5         1.0   
3          0.0  7015.845717        38.0        -0.5         7.5   

   Supplementary_sign_type  vision_only_sign_type  
0                      0.0                    7.0  
1                      0.0                    9.0  
2                      0.0                    9.0  
3                      0.0                   35.0  

Upvotes: 2

Related Questions