Reputation: 195
I have a dataframe which looks like below:
user_id event_name event_params
10 start /pseudo
10 subcategory /home
10 add_basket_click /click
10 add_basket_error /event
10 end /end
11 start /pseudo
11 add_basket_click /click
11 add_basket_error /event
I want to swap rows for event_name, add_basket_click & add_basket_error. Currently add_basket_error comes after add_basket_click, I want it come before it. Output should like below. The real dataset has 10 million rows, so looking for a pythonic answer.
user_id event_name event_params
10 start /pseudo
10 subcategory /home
10 add_basket_error /event
10 add_basket_click /click
10 end /end
11 start /pseudo
11 add_basket_error /event
11 add_basket_click /click
Upvotes: 3
Views: 251
Reputation: 195
I did the following and it worked
df['scounter'] = df.groupby('user_id').cumcount()+1
#
df1 = df[df.event_name == 'Add_Basket_Error']
df = df[df.event_name != 'Add_Basket_Error']
#
df1['scounter'] = df1['scounter'] - 1.1
#
df = df.append(df1, ignore_index=True)
#
df.sort_values(['user_id', 'scounter'], ascending = [True, True], inplace=True)
df = df.reset_index(drop=True)
Upvotes: 0
Reputation: 862611
Here is possible create list of all possible values in event_name
by expected order, then convert column to ordered categoricals, so possible sorting by 2 columns with DataFrame.sort_values
:
L = ['start','subcategory','add_basket_error','add_basket_click','end']
df['event_name'] = pd.Categorical(df['event_name'], ordered=True, categories=L)
df = df.sort_values(['user_id','event_name'])
print (df)
user_id event_name event_params
0 10 start /pseudo
1 10 subcategory /home
3 10 add_basket_error /event
2 10 add_basket_click /click
4 10 end /end
5 11 start /pseudo
7 11 add_basket_error /event
6 11 add_basket_click /click
EDIT:
#added separate row 1 - not changed in output
print (df)
user_id event_name event_params
0 10 start /pseudo
1 10 add_basket_error /event
2 10 subcategory /home
3 10 add_basket_click /click
4 10 add_basket_error /event
5 10 end /end
6 11 start /pseudo
7 11 add_basket_click /click
8 11 add_basket_error /event
You can compare by Series.eq
and also Series.shift
ed values, last assign back swapped rows:
m11 = df['event_name'].eq('add_basket_click')
m12 = df['event_name'].shift(-1).eq('add_basket_error')
m21 = df['event_name'].eq('add_basket_error')
m22 = df['event_name'].shift().eq('add_basket_click')
df[m21 & m22], df[m11 & m12] = df[m11 & m12].values, df[m21 & m22].values
print (df)
user_id event_name event_params
0 10 start /pseudo
1 10 add_basket_error /event
2 10 subcategory /home
3 10 add_basket_error /event
4 10 add_basket_click /click
5 10 end /end
6 11 start /pseudo
7 11 add_basket_error /event
8 11 add_basket_click /click
Upvotes: 3
Reputation: 18647
Here is one potential solution, using boolean indexing
and loc
:
# Boolean series of event_name containing 'add_basket_error'
s = df.event_name.str.contains('add_basket_error')
# Create 2 frames, errors and events from boolean index 's'
errors, events = (df.loc[s[s].index].copy(), df.loc[s[s].index - 1].copy())
# Swap event and error values
df.loc[s[s].index] = events.values
df.loc[s[s].index - 1] = errors.values
print(df)
[output]
user_id event_name event_params
0 10 start /pseudo
1 10 subcategory /home
2 10 add_basket_error /event
3 10 add_basket_click /click
4 10 end /end
5 11 start /pseudo
6 11 add_basket_error /event
7 11 add_basket_click /click
Upvotes: 3