Reputation: 2434
I have a df with id
as the flow id, dttm
as the step
modification time, and step
as the steps in the flow. It is currently ordered by dttm
. There could be any number of steps for a particular id
.
Current df:
id dttm step
0 81 2015-05-26 07:56:03 A
1 81 2015-05-26 08:19:07 B
2 81 2015-05-26 08:32:05 C
3 91 2015-05-26 08:07:12 B
4 91 2015-05-26 08:07:12 C
I want to create link data to feed into a Sankey. Therefore I ultimately want a df with three columns: source
, target
, and value
. value
is the count of id
s that have such a step pair.
Desired df:
source target value
0 A B 1
1 B C 2
I know I can stuff step
into one row with either groupby
or possibly cat
. However, I think that would just create a different starting point without actually advancing a solution. Part of what makes it tough is that the steps depend on dttm
to stay ordered so that the steps are appropriately paired. Also, the fact that it has to be dynamic because there could be any number of steps adds to the difficulty.
How should I dynamically "stuff" the step
column to arrive at the link data?
Is there a way to join the df to itself to get all of the pairs, and then remove rows that are created during the join but are nonsense?
Thank you for any and all insight!
Upvotes: 4
Views: 2140
Reputation: 2434
I figured out a solution. It is not a one step solution like Scott's, but I think it's a little simpler to understand. also, it runs much, much faster:
Add a sequence to each group of id
. then duplicate the df to allow a self-merge:
df['sequence'] = df.groupby('id').cumcount()
df2 = df
Merge the dataframes on id
, then only keep the rows that make sense based on the sequence:
result = df.merge(df2, on='id', how='left')
result = result[result['sequence_x']+1==result['sequence_y']]
Group the resulting df by step pairs and use size
to get value
. Then reset the index and rename the columns:
result = pd.DataFrame(result.groupby(['step_x','step_y']).size().rename('value'))
result = result.reset_index()
result.columns = ['source','target','value']
Upvotes: 4
Reputation: 153460
Let's try:
(df.groupby('id')['step'].apply(lambda x: pd.DataFrame(list(zip(x, x[1:]))).set_index([0,1]).assign(count=1))
.rename_axis(['id','source','target'])
.sum(level=[1,2]).reset_index())
Output:
source target count
0 A B 1
1 B C 2
Upvotes: 1