OverflowingTheGlass
OverflowingTheGlass

Reputation: 2434

Pandas Create Link Pairs from Multiple Rows

I have a df with id as the flow id, dttm as the step modification time, and step as the steps in the flow. It is currently ordered by dttm. There could be any number of steps for a particular id.

Current df:

    id      dttm                  step
0   81      2015-05-26 07:56:03   A
1   81      2015-05-26 08:19:07   B
2   81      2015-05-26 08:32:05   C
3   91      2015-05-26 08:07:12   B
4   91      2015-05-26 08:07:12   C

I want to create link data to feed into a Sankey. Therefore I ultimately want a df with three columns: source, target, and value. value is the count of ids that have such a step pair.

Desired df:

    source   target   value
0   A        B        1
1   B        C        2

I know I can stuff step into one row with either groupby or possibly cat. However, I think that would just create a different starting point without actually advancing a solution. Part of what makes it tough is that the steps depend on dttm to stay ordered so that the steps are appropriately paired. Also, the fact that it has to be dynamic because there could be any number of steps adds to the difficulty.

How should I dynamically "stuff" the step column to arrive at the link data?

Is there a way to join the df to itself to get all of the pairs, and then remove rows that are created during the join but are nonsense?

Thank you for any and all insight!

Upvotes: 4

Views: 2140

Answers (2)

OverflowingTheGlass
OverflowingTheGlass

Reputation: 2434

I figured out a solution. It is not a one step solution like Scott's, but I think it's a little simpler to understand. also, it runs much, much faster:

Add a sequence to each group of id. then duplicate the df to allow a self-merge:

df['sequence'] = df.groupby('id').cumcount()
df2 = df

Merge the dataframes on id, then only keep the rows that make sense based on the sequence:

result = df.merge(df2, on='id', how='left')
result = result[result['sequence_x']+1==result['sequence_y']]

Group the resulting df by step pairs and use size to get value. Then reset the index and rename the columns:

result = pd.DataFrame(result.groupby(['step_x','step_y']).size().rename('value'))
result = result.reset_index()
result.columns = ['source','target','value']

Upvotes: 4

Scott Boston
Scott Boston

Reputation: 153460

Let's try:

(df.groupby('id')['step'].apply(lambda x: pd.DataFrame(list(zip(x, x[1:]))).set_index([0,1]).assign(count=1))
   .rename_axis(['id','source','target'])
   .sum(level=[1,2]).reset_index())

Output:

  source target  count
0      A      B      1
1      B      C      2

Upvotes: 1

Related Questions