Reputation: 43
I have a dataframe that looks like this:
id userid int1 int2 string string2 string3
1 90 5067 1000 aaa 100 qqq
1 90 6945 1000 bbb 101 qqq
1 90 9596 1010 ccc 102 qqq
The desired output:
id userid int1 int2 string string2 string3
1 90 [5067,6945,9596] [1000,1000,1010] [aaa,bbb,ccc] [100,101,102] qqq
I have tried to create a new dataframe for int1,int2,string,string2 and then merge them but this solution was very slow as I have a big dataset. A different solution is to use 'groupby':
df = df.groupby(['id','userid','string3']).[['int1'],['int2'],['string'],['string2']].apply(list).reset_index()
but this gives me this error:
AttributeError: 'Series' object has no attribute 'columns'
Any help is appreciated.
Upvotes: 2
Views: 45892
Reputation: 18367
You are passing individual values (lists) within a list. You don't need to use doble brackets like you do when filtering a dataframe and expect a dataframe output. Hopefully this will work better for you:
df = df.groupby(['id','userid','string3'])['int1','int2','string','string2'].apply(list).reset_index()
Upvotes: 0
Reputation: 30920
Use GroupBy.agg
with as_index=False
+ DataFrame.reindex
to return the columns in the initial order:
new_df=( df.groupby(['id','userid','string3'],as_index=False)
.agg(list)
.reindex(columns=df.columns) )
print(new_df)
If you want you could select the columns:
cols=['int1','int2','string','string2']
new_df=( df.groupby(['id','userid','string3'],as_index=False)[cols]
.agg(list)
.reindex(columns=df.columns) )
Output
id userid int1 int2 string \
0 1 90 [5067, 6945, 9596] [1000, 1000, 1010] [aaa, bbb, ccc]
string2 string3
0 [100, 101, 102] qqq
Upvotes: 3