Reputation: 21333
I have a CSV file that looks like this:
id1,feat1,feat2,feat3
a,b,asd,asg
c,d,dg,ag
a,e,sdg,as
c,f,as,sdg
c,g,adg,sd
I read it in to a dataframe with df = pd.read_csv("file.csv")
.
I would like group by id1
and combine all the other columns in one line in the group with the header name added as a prefix. That is the output should be a data frame that looks like:
a [feat1=b,feat1=e,feat2=asd,feat2=sdg,feat3=asg,feat3=as]
c [feat1=d,feat1=f,feat1=g,feat2=dg,feat2=as,feat2=adg,feat3=ag,feat3=sdg,feat3=sd]
df.groupby('id1')
will start me off but I am not sure where to go from there.
What is a good way to do this?
Upvotes: 3
Views: 293
Reputation: 394159
You can use a custom function and apply
on the groupby
object, the function calls apply
again on the Series passed to zip the column names and values into a list, we then perform a list comprehension and return this inside a list as desired:
In [54]:
def foo(x):
l = (x.apply(lambda x: x.name + '=' + x)).values.tolist()
return pd.Series([[i for j in l for i in j]])
gp = df.groupby('id1')[['feat1','feat2','feat3']]
gp1 = gp.apply(foo)
gp1
Out[54]:
0
id1
a [feat1=b, feat2=asd, feat3=asg, feat1=e, feat2...
c [feat1=d, feat2=dg, feat3=ag, feat1=f, feat2=a...
if we look at the contents we see that we have a list of the values:
In [55]:
gp1.iloc[0].values
Out[55]:
array([['feat1=b', 'feat2=asd', 'feat3=asg', 'feat1=e', 'feat2=sdg', 'feat3=as']], dtype=object)
Upvotes: 2
Reputation: 16251
Applying this function will work:
def func(dfg):
dfu = dfg.unstack()
result = dfu.index.get_level_values(0) + '=' + dfu.values
return result.tolist()
df.groupby('id1').apply(func)
Explanation: let's consider one group, for instance dfg = df[df['id1'] == 'c']
.
dfg.unstack()
Out[35]:
id1 1 c
3 c
4 c
feat1 1 d
3 f
4 g
feat2 1 dg
3 as
4 adg
feat3 1 ag
3 sdg
4 sd
By unstacking you get the values aligned with the column names (ignore the index values in between). All you need to do is concatenate:
dfu.index.get_level_values(0) + '=' + dfu.values
Out[36]:
Index(['feat1=d', 'feat1=f', 'feat1=g', 'feat2=dg', 'feat2=as', 'feat2=adg',
'feat3=ag', 'feat3=sdg', 'feat3=sd'],
dtype='object')
Finally, convert to list before returning, otherwise you end up with index objects.
Upvotes: 1