Reputation: 710
From a dataframe, I want to create a dataframe with new columns if the index is already found without knowing how many columns I will create :
pd.DataFrame([["John","guitar"],["Michael","football"],["Andrew","running"],["John","dancing"],["Andrew","cars"]])
and I want :
pd.DataFrame([["John","guitar","dancing"],["Michael","Football",None],["Andrew","running","cars"]])
without knowing how many columns I should create at the start.
Upvotes: 4
Views: 1799
Reputation: 3902
Assuming the column names being ['person', 'activity']
you can do
df_out = df.groupby('person').agg(list).reset_index()
df_out = pd.concat([df_out, pd.DataFrame(df_out['activity'].values.tolist())], axis=1)
df_out = df_out.drop('activity', 1)
giving you
person 0 1
0 Andrew running cars
1 John guitar dancing
2 Michael football None
Upvotes: 0
Reputation: 862511
Use GroupBy.cumcount
for get counter
and then reshape by unstack
:
df1 = pd.DataFrame([["John","guitar"],
["Michael","football"],
["Andrew","running"],
["John","dancing"],
["Andrew","cars"]], columns=['a','b'])
a b
0 John guitar
1 Michael football
2 Andrew running
3 John dancing
4 Andrew cars
df = (df1.set_index(['a', df1.groupby('a').cumcount()])['b']
.unstack()
.rename_axis(-1)
.reset_index()
.rename(columns=lambda x: x+1))
print (df)
0 1 2
0 Andrew running cars
1 John guitar dancing
2 Michael football NaN
Or aggregate list
and create new dictionary by constructor:
s = df1.groupby('a')['b'].agg(list)
df = pd.DataFrame(s.values.tolist(), index=s.index).reset_index()
print (df)
a 0 1
0 Andrew running cars
1 John guitar dancing
2 Michael football None
Upvotes: 1
Reputation: 88226
df = pd.DataFrame([["John","guitar"],["Michael","football"],["Andrew","running"],["John","dancing"],["Andrew","cars"]], columns = ['person','hobby'])
You can groupby person
and search for unique
in hobby
. Then use .apply(pd.Series)
to expand lists into columns:
df.groupby('person').hobby.unique().apply(pd.Series).reset_index()
person 0 1
0 Andrew running cars
1 John guitar dancing
2 Michael football NaN
In the case of having a large dataframe, try the more efficient alternative:
df = df.groupby('person').hobby.unique()
df = pd.DataFrame(df.values.tolist(), index=df.index).reset_index()
Which in essence does the same, but avoids looping over rows when applying pd.Series
.
Upvotes: 6