Reputation: 508
I am merging several data frames into one data frame in a for loop. something like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(data=np.random.randint(0,100,(2,5)),columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0,100,(2,5)),columns=list('GHABC'))
df1 = df1.merge(df2, how='outer', left_index=True, right_index=True,suffixes=('', '_' + 'second'))
Several columns from each of these data frames have similar names so each time there is such a column I am adding a suffix. It is becoming a mess. In addition, I would like to be able to quickly access all the columns from the first table, second one, etc. Is there a way to merge these columns but keep them as a group? So that I don't need to change the column name and I can access all the columns of each data set easier?
Upvotes: 0
Views: 447
Reputation: 13697
Here is a way to merge the dataframes horizonthally with the help of MultiIndex, which has some advantages comparing to vertical merge. For example, you won't have lots of NaN
fields, and the dtype's won't change from int
to float
like in the case with horizonthal merge.
import numpy as np
import pandas as pd
df1 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
columns=list('GHABC'))
dfs = [df1, df2]
result = pd.concat(dfs, axis=1, keys=range(len(dfs)))
print(result)
This will give:
0 1
A B C D E G H A B C
0 41 49 13 36 57 28 12 82 18 67
1 72 91 34 17 12 6 67 98 36 25
You can access each group in a loop:
for source_index, df in result.groupby(axis=1, level=0):
print(df)
0
A B C D E
0 41 49 13 36 57
1 72 91 34 17 12
1
G H A B C
0 28 12 82 18 67
1 6 67 98 36 25
or individually:
gb = result.groupby(axis=1, level=0)
first_group = gb.get_group(0)
print(first_group)
0
A B C D E
0 41 49 13 36 57
1 72 91 34 17 12
References:
Upvotes: 1
Reputation: 13697
Here is an illustration of what @QuangHoang proposed in their comment:
import pandas as pd
df1 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
columns=list('GHABC'))
dfs = [df1, df2]
for source_index, df in enumerate(dfs):
df['data_source'] = source_index
result = pd.concat(dfs)
print(result)
will give all your dataframes joined vertically:
A B C D E G H data_source
0 66 52 16 73.0 59.0 NaN NaN 0
1 73 64 59 31.0 13.0 NaN NaN 0
0 72 79 45 NaN NaN 30.0 0.0 1
1 45 52 40 NaN NaN 2.0 80.0 1
To access each group in a loop you can do:
for source_index, df in result.groupby('data_source'):
print(df.dropna(axis=1))
A B C D E data_source
0 66 52 16 73.0 59.0 0
1 73 64 59 31.0 13.0 0
A B C G H data_source
0 72 79 45 30.0 0.0 1
1 45 52 40 2.0 80.0 1
or by an index of the source:
gb = result.groupby('data_source')
source_index = 0
first_group = gb.get_group(source_index).dropna(axis=1)
print(first_group)
A B C D E data_source
0 66 52 16 73.0 59.0 0
1 73 64 59 31.0 13.0 0
Upvotes: 0