jaumebonet
jaumebonet

Reputation: 2256

Collapse rows of python dataframe that share one column's value

I have the feeling that there has to be a pretty straight-forward way to do this, but I can't find it.

So, I have this data (notice that description column has a shared part between several):

import pandas as pd

data = {"description": ["AAAA:A", "AAAA:B", "AAAA:C", "AAAA:D", "BBBB:A", "BBBB:B"],
        "sequence": ["AAAAAAAAAAA", "AAAAAAABBBBBB", "AAAAAAAACCCCCCC", "AAAAAAAADDDDDDD",
                     "BBBBBBAAAAA", "BBBBBBBBBBBBB"]}

df = pd.DataFrame(data)
print df

#  description         sequence
#0      AAAA:A      AAAAAAAAAAA
#1      AAAA:B    AAAAAAABBBBBB
#2      AAAA:C  AAAAAAAACCCCCCC
#3      AAAA:D  AAAAAAAADDDDDDD
#4      BBBB:A      BBBBBBAAAAA
#5      BBBB:B    BBBBBBBBBBBBB

And my final aim is to put together all the sequences for a single 4 letter description. Something like this:

#  description   sequence_A     sequence_B       sequence_C       sequence_D
#0        AAAA  AAAAAAAAAAA  AAAAAAABBBBBB  AAAAAAAACCCCCCC  AAAAAAAADDDDDDD
#1        BBBB  BBBBBBAAAAA  BBBBBBBBBBBBB              NaN              NaN

Right until now, I've gotten to this point:

df = df.apply(lambda row: pd.Series({"description": row["description"].split(":")[0],
                                     "sequence_{}".format(row["description"].split(":")[1]): row["sequence"]}),
              axis=1)
print df

#  description   sequence_A     sequence_B       sequence_C       sequence_D
#0        AAAA  AAAAAAAAAAA            NaN              NaN              NaN
#1        AAAA          NaN  AAAAAAABBBBBB              NaN              NaN
#2        AAAA          NaN            NaN  AAAAAAAACCCCCCC              NaN
#3        AAAA          NaN            NaN              NaN  AAAAAAAADDDDDDD
#4        BBBB  BBBBBBAAAAA            NaN              NaN              NaN
#5        BBBB          NaN  BBBBBBBBBBBBB              NaN              NaN

I'm guessing that I need to df.groupby("description") and then do one more step there, but I'm missing that last point.

Upvotes: 2

Views: 1578

Answers (2)

Vaishali
Vaishali

Reputation: 38415

Use pivot

df[['description', 'seq']] = df['description'].str.split(':', expand = True)
df.pivot('description', 'seq', 'sequence').add_prefix('Sequence_')


seq     Sequence_A  Sequence_B      Sequence_C      Sequence_D
description             
AAAA    AAAAAAAAAAA AAAAAAABBBBBB   AAAAAAAACCCCCCC AAAAAAAADDDDDDD
BBBB    BBBBBBAAAAA BBBBBBBBBBBBB   None            None

Upvotes: 3

BENY
BENY

Reputation: 323276

split then pivot

df[['New1','New2']]=df.description.str.split(':',expand=True)
s=df[['New1','New2','sequence']]

s.pivot(*s.columns).add_prefix('sequence_')

Out[863]: 
New2   sequence_A     sequence_B       sequence_C       sequence_D
New1                                                              
AAAA  AAAAAAAAAAA  AAAAAAABBBBBB  AAAAAAAACCCCCCC  AAAAAAAADDDDDDD
BBBB  BBBBBBAAAAA  BBBBBBBBBBBBB             None             None

Upvotes: 3

Related Questions