Reputation: 9336
Suppose I have the following table in a csv.
elementID | groupID | sequence
abc | A | 0
dcv | A | 1
asd | B | 1
ccc | B | 0
abc | B | 2
I imported this table in Pandas
as a DataFrame.
From this DataFrame I need to create another one where each entry represents two two sequential elements of elementID
.
For example something like the following
elementID_1 | elementID_2 | groupID
abc | dcv | A
asd | abc | B
ccc | asd | B
At first I thought of a multi-row selection but could not find out how to do that.
So I decided to do step by step creating one column per time.
I created first the column elementID_2
as follows
new_df = pd.DataFrame()
new_df[["elementID_2", "sequence", "groupID"]] = old_df.loc[old_df.loc.sequence>1][["elementID", "sequence", "groupID"]]
Now to create the elementID_1
column I would like to match the entries of the old dataframe having assequence
value, the sequence
value of the new DataFrame minus 1 and the same groupdID
between the two DataFrame.
How can I do that?
Upvotes: 1
Views: 1309
Reputation: 150735
IIUC, you can do so by shift()
on groups and dropna
:
df.sort_values('sequence', inplace=True)
df['elementID_1'] = df.groupby('groupID').elementID.shift()
df.dropna()
Output:
+----+-------------+-----------+------------+---------------+
| | elementID | groupID | sequence | elementID_1 |
|----+-------------+-----------+------------+---------------|
| 1 | dcv | A | 1 | abc |
| 2 | asd | B | 3 | ccc |
| 4 | abc | B | 4 | asd |
+----+-------------+-----------+------------+---------------+
Then you can drop sequence
column and rename elementID
to elementID_2
.
Upvotes: 1
Reputation: 323226
Here is one way using groupby
+ itertools
s=df.groupby('groupID').elementID.apply(lambda x : set(itertools.combinations(x.values.tolist(), 2)))
df1=pd.DataFrame(s.index.repeat(s.str.len()))
df2=pd.DataFrame(list(itertools.chain(*s.tolist())))
pd.concat([df1,df2],axis=1)
Out[286]:
groupID 0 1
0 A abc dcv
1 B asd abc
2 B asd ccc
3 B ccc abc
Upvotes: 2