Reputation: 33
What is a systematic way to go from this:
x = {'col0': [1, 1, 2, 2], 'col1': ['a', 'b', 'a', 'b'],
'col2': ['x', 'x', 'x', 'x'], 'col3': [12, 13, 14, 15]}
y = pd.DataFrame(data=x)
y
col0 col1 col2 col3
0 1 a x 12
1 1 b x 13
2 2 a x 14
3 2 b x 15
To this:
y2
col0 col3__a_x col3__b_x
0 1 12 13
1 2 14 15
I was initially thinking something like cast
from the reshape2 package from R. However, I'm much less familiar with Pandas/Python than I am with R.
In the dataset I'm working with col1
has 3 different values, col2
is all the same value, ~200,000 rows, and ~80 other columns that would get the suffix added.
Upvotes: 3
Views: 1002
Reputation: 294488
Using pd.factorize
and Numpy slice assignment we can construct the data frame we need.
i, r = pd.factorize(df.col0)
j, c = pd.factorize(df.col1.str.cat(df.col2, '_'))
b = np.zeros((r.size, c.size), np.int64)
b[i, j] = df.col3.values
d = pd.DataFrame(
np.column_stack([r, b]),
columns=['col0'] + ['col3__' + col for col in c]
)
d
col0 col3__a_x col3__b_x
0 1 12 13
1 2 14 15
Upvotes: 4
Reputation: 153500
You can do it using set_index
and unstack
if you don't have multiple values for resulting rows and columns otherwise you'll have to use a aggregation method such as pivot_table or groupby:
df_out = y.set_index(['col0','col1','col2']).unstack([1,2])
df_out.columns = df_out.columns.map('_'.join)
df_out.reset_index()
Output:
col0 col3_a_x col3_b_x
0 1 12 13
1 2 14 15
Or with multiple values using groupby:
df_out = y.groupby(['col0','col1','col2']).mean().unstack([1,2])
df_out.columns = df_out.columns.map('_'.join)
df_out.reset_index()
Upvotes: 4
Reputation: 51395
I think that @Wen 's solution is probably better, as it is pure pandas
, but here is another solution if you want to use numpy
:
import numpy as np
d = y.groupby('col0').apply(lambda x: x['col3']).unstack().values
d = d[~np.isnan(d)].reshape(len(d),-1)
new_df = pd.DataFrame(d).reset_index().rename(columns={'index': 'col0', 0: 'col3_a_x', 1:'col3_b_x'})
>>> new_df
col0 col3_a_x col3_b_x
0 0 12.0 13.0
1 1 14.0 15.0
Upvotes: 2
Reputation: 323316
You will need pviot
and column faltten
s=pd.pivot_table(y,index='col0',columns=['col1','col2'],values='col3')
s.columns=s.columns.map('_'.join)
s.add_prefix('col3_').reset_index()
Out[1383]:
col0 col3_a_x col3_b_x
0 1 12 13
1 2 14 15
Upvotes: 6