Reputation: 2256
I have the feeling that there has to be a pretty straight-forward way to do this, but I can't find it.
So, I have this data (notice that description
column has a shared part between several):
import pandas as pd
data = {"description": ["AAAA:A", "AAAA:B", "AAAA:C", "AAAA:D", "BBBB:A", "BBBB:B"],
"sequence": ["AAAAAAAAAAA", "AAAAAAABBBBBB", "AAAAAAAACCCCCCC", "AAAAAAAADDDDDDD",
"BBBBBBAAAAA", "BBBBBBBBBBBBB"]}
df = pd.DataFrame(data)
print df
# description sequence
#0 AAAA:A AAAAAAAAAAA
#1 AAAA:B AAAAAAABBBBBB
#2 AAAA:C AAAAAAAACCCCCCC
#3 AAAA:D AAAAAAAADDDDDDD
#4 BBBB:A BBBBBBAAAAA
#5 BBBB:B BBBBBBBBBBBBB
And my final aim is to put together all the sequences for a single 4 letter description. Something like this:
# description sequence_A sequence_B sequence_C sequence_D
#0 AAAA AAAAAAAAAAA AAAAAAABBBBBB AAAAAAAACCCCCCC AAAAAAAADDDDDDD
#1 BBBB BBBBBBAAAAA BBBBBBBBBBBBB NaN NaN
Right until now, I've gotten to this point:
df = df.apply(lambda row: pd.Series({"description": row["description"].split(":")[0],
"sequence_{}".format(row["description"].split(":")[1]): row["sequence"]}),
axis=1)
print df
# description sequence_A sequence_B sequence_C sequence_D
#0 AAAA AAAAAAAAAAA NaN NaN NaN
#1 AAAA NaN AAAAAAABBBBBB NaN NaN
#2 AAAA NaN NaN AAAAAAAACCCCCCC NaN
#3 AAAA NaN NaN NaN AAAAAAAADDDDDDD
#4 BBBB BBBBBBAAAAA NaN NaN NaN
#5 BBBB NaN BBBBBBBBBBBBB NaN NaN
I'm guessing that I need to df.groupby("description")
and then do one more step there, but I'm missing that last point.
Upvotes: 2
Views: 1578
Reputation: 38415
Use pivot
df[['description', 'seq']] = df['description'].str.split(':', expand = True)
df.pivot('description', 'seq', 'sequence').add_prefix('Sequence_')
seq Sequence_A Sequence_B Sequence_C Sequence_D
description
AAAA AAAAAAAAAAA AAAAAAABBBBBB AAAAAAAACCCCCCC AAAAAAAADDDDDDD
BBBB BBBBBBAAAAA BBBBBBBBBBBBB None None
Upvotes: 3
Reputation: 323276
split
then pivot
df[['New1','New2']]=df.description.str.split(':',expand=True)
s=df[['New1','New2','sequence']]
s.pivot(*s.columns).add_prefix('sequence_')
Out[863]:
New2 sequence_A sequence_B sequence_C sequence_D
New1
AAAA AAAAAAAAAAA AAAAAAABBBBBB AAAAAAAACCCCCCC AAAAAAAADDDDDDD
BBBB BBBBBBAAAAA BBBBBBBBBBBBB None None
Upvotes: 3