Laleh
Laleh

Reputation: 508

Group by columns while merging data frames

I am merging several data frames into one data frame in a for loop. something like this:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(data=np.random.randint(0,100,(2,5)),columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0,100,(2,5)),columns=list('GHABC'))
df1 = df1.merge(df2, how='outer', left_index=True, right_index=True,suffixes=('', '_' + 'second'))

Several columns from each of these data frames have similar names so each time there is such a column I am adding a suffix. It is becoming a mess. In addition, I would like to be able to quickly access all the columns from the first table, second one, etc. Is there a way to merge these columns but keep them as a group? So that I don't need to change the column name and I can access all the columns of each data set easier?

Upvotes: 0

Views: 447

Answers (2)

Georgy
Georgy

Reputation: 13697

Here is a way to merge the dataframes horizonthally with the help of MultiIndex, which has some advantages comparing to vertical merge. For example, you won't have lots of NaN fields, and the dtype's won't change from int to float like in the case with horizonthal merge.

import numpy as np
import pandas as pd


df1 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('GHABC'))
dfs = [df1, df2]
result = pd.concat(dfs, axis=1, keys=range(len(dfs)))
print(result)

This will give:

    0                   1                
    A   B   C   D   E   G   H   A   B   C
0  41  49  13  36  57  28  12  82  18  67
1  72  91  34  17  12   6  67  98  36  25

You can access each group in a loop:

for source_index, df in result.groupby(axis=1, level=0):
    print(df)
    0                
    A   B   C   D   E
0  41  49  13  36  57
1  72  91  34  17  12
    1                
    G   H   A   B   C
0  28  12  82  18  67
1   6  67  98  36  25

or individually:

gb = result.groupby(axis=1, level=0)
first_group = gb.get_group(0)
print(first_group)
    0                
    A   B   C   D   E
0  41  49  13  36  57
1  72  91  34  17  12

References:

Upvotes: 1

Georgy
Georgy

Reputation: 13697

Here is an illustration of what @QuangHoang proposed in their comment:

import pandas as pd


df1 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('GHABC'))
dfs = [df1, df2]
for source_index, df in enumerate(dfs):
    df['data_source'] = source_index
result = pd.concat(dfs)
print(result)

will give all your dataframes joined vertically:

    A   B   C     D     E     G     H  data_source
0  66  52  16  73.0  59.0   NaN   NaN            0
1  73  64  59  31.0  13.0   NaN   NaN            0
0  72  79  45   NaN   NaN  30.0   0.0            1
1  45  52  40   NaN   NaN   2.0  80.0            1

To access each group in a loop you can do:

for source_index, df in result.groupby('data_source'):
    print(df.dropna(axis=1))
    A   B   C     D     E  data_source
0  66  52  16  73.0  59.0            0
1  73  64  59  31.0  13.0            0
    A   B   C     G     H  data_source
0  72  79  45  30.0   0.0            1
1  45  52  40   2.0  80.0            1

or by an index of the source:

gb = result.groupby('data_source')
source_index = 0
first_group = gb.get_group(source_index).dropna(axis=1)
print(first_group)
    A   B   C     D     E  data_source
0  66  52  16  73.0  59.0            0
1  73  64  59  31.0  13.0            0

Upvotes: 0

Related Questions