Simd
Simd

Reputation: 21333

Pandas groupby where all columns are added to a list prefixed by column name

I have a CSV file that looks like this:

id1,feat1,feat2,feat3
a,b,asd,asg
c,d,dg,ag
a,e,sdg,as
c,f,as,sdg
c,g,adg,sd

I read it in to a dataframe with df = pd.read_csv("file.csv").

I would like group by id1 and combine all the other columns in one line in the group with the header name added as a prefix. That is the output should be a data frame that looks like:

a [feat1=b,feat1=e,feat2=asd,feat2=sdg,feat3=asg,feat3=as]
c [feat1=d,feat1=f,feat1=g,feat2=dg,feat2=as,feat2=adg,feat3=ag,feat3=sdg,feat3=sd]

df.groupby('id1') will start me off but I am not sure where to go from there.

What is a good way to do this?

Upvotes: 3

Views: 293

Answers (2)

EdChum
EdChum

Reputation: 394159

You can use a custom function and apply on the groupby object, the function calls apply again on the Series passed to zip the column names and values into a list, we then perform a list comprehension and return this inside a list as desired:

In [54]:    
def foo(x):
    l = (x.apply(lambda x: x.name + '=' + x)).values.tolist()
    return pd.Series([[i for j in l for i in j]])
​
gp = df.groupby('id1')[['feat1','feat2','feat3']]
gp1 = gp.apply(foo)
gp1

Out[54]:
                                                     0
id1                                                   
a    [feat1=b, feat2=asd, feat3=asg, feat1=e, feat2...
c    [feat1=d, feat2=dg, feat3=ag, feat1=f, feat2=a...

if we look at the contents we see that we have a list of the values:

In [55]:    
gp1.iloc[0].values

Out[55]:
array([['feat1=b', 'feat2=asd', 'feat3=asg', 'feat1=e', 'feat2=sdg', 'feat3=as']], dtype=object)

Upvotes: 2

IanS
IanS

Reputation: 16251

Applying this function will work:

def func(dfg):
    dfu = dfg.unstack()
    result = dfu.index.get_level_values(0) + '=' + dfu.values
    return result.tolist()

df.groupby('id1').apply(func)

Explanation: let's consider one group, for instance dfg = df[df['id1'] == 'c'].

dfg.unstack()
Out[35]: 
id1    1      c
       3      c
       4      c
feat1  1      d
       3      f
       4      g
feat2  1     dg
       3     as
       4    adg
feat3  1     ag
       3    sdg
       4     sd

By unstacking you get the values aligned with the column names (ignore the index values in between). All you need to do is concatenate:

dfu.index.get_level_values(0) + '=' + dfu.values
Out[36]: 
Index(['feat1=d', 'feat1=f', 'feat1=g', 'feat2=dg', 'feat2=as', 'feat2=adg',
       'feat3=ag', 'feat3=sdg', 'feat3=sd'],
      dtype='object')

Finally, convert to list before returning, otherwise you end up with index objects.

Upvotes: 1

Related Questions