Reputation: 2391
I've got two input dataframes
df1 (note, this DF could have more columns of data)
Sample Animal Time Sex
0 1 A one male
1 2 A two male
2 3 B one female
3 4 C one male
4 5 D one female
and df2
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
and I'd like to combine them so that I get the following:
one_a one_b one_c two_a two_b two_c Sex
Animal
A 0.2 0.4 0.3 0.5 0.7 0.2 male
B 0.4 0.1 0.9 NaN NaN NaN female
C 0.4 0.2 0.3 NaN NaN NaN male
D 0.6 0.2 0.4 NaN NaN NaN female
This is how I'm doing things:
df2.reset_index(inplace = True)
df3 = pd.melt(df2, id_vars=['Sample'], value_vars=list(cols))
df4 = pd.merge(df3, df1, on='Sample')
df4['moo'] = df4['Group'] + '_' + df4['variable']
df5 = pd.pivot_table(df4, values='value', index='Animal', columns='moo')
df6 = df1.groupby('Animal').agg('first')
pd.concat([df5, df6], axis=1).drop('Sample',1).drop('Group',1)
This works just fine, but could potentially be slow for large datasets. I'm wondering if any panda-pros see a better (read faster, more efficient)? I'm new to pandas and can imagine there are some shortcuts here that I don't know about.
Upvotes: 1
Views: 1864
Reputation: 24742
A few steps here. The key is that in order to create columns like one_a one_b .... two_c
, we need add Time
column to Sample
index to build a multi-level index and then unstack
to get the required form. Then, a groupby
on Animal
index is required to aggregate and reduce the number of NaN
s. The rest are just some manipulations on format.
import pandas as pd
# your data
# ==============================
# set index
df1 = df1.set_index('Sample')
print(df1)
Animal Time Sex
Sample
1 A one male
2 A two male
3 B one female
4 C one male
5 D one female
print(df2)
a b c
Sample
1 0.2 0.4 0.3
2 0.5 0.7 0.2
3 0.4 0.1 0.9
4 0.4 0.2 0.3
5 0.6 0.2 0.4
# processing
# =============================
df = df1.join(df2)
df_temp = df.set_index(['Animal', 'Sex','Time'], append=True).unstack()
print(df_temp)
a b c
Time one two one two one two
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
# rename the columns if you wish
df_temp.columns = ['{}_{}'.format(x, y) for x, y in zip(df_temp.columns.get_level_values(1), df_temp.columns.get_level_values(0))]
print(df_temp)
one_a two_a one_b two_b one_c two_c
Sample Animal Sex
1 A male 0.2 NaN 0.4 NaN 0.3 NaN
2 A male NaN 0.5 NaN 0.7 NaN 0.2
3 B female 0.4 NaN 0.1 NaN 0.9 NaN
4 C male 0.4 NaN 0.2 NaN 0.3 NaN
5 D female 0.6 NaN 0.2 NaN 0.4 NaN
result = df_temp.reset_index('Sex').groupby(level='Animal').agg(max).sort_index(axis=1)
print(result)
Sex one_a one_b one_c two_a two_b two_c
Animal
A male 0.2 0.4 0.3 0.5 0.7 0.2
B female 0.4 0.1 0.9 NaN NaN NaN
C male 0.4 0.2 0.3 NaN NaN NaN
D female 0.6 0.2 0.4 NaN NaN NaN
Upvotes: 1