Reputation: 7526
I have a pandas dataframe where I would like to compare the values in Var1
and Var2
import pandas as pd
data = [['foo', 'foo', 1613030200], \
['foo', 'foo', 1613030300], ['foo', 'bar', 1613030400], \
['foo', 'foo', 1613030500], ['foo', 'foo', 1613030600], ['bar', 'foo', 1613030700],\
['foo', 'foo', 1613030800], ['foo', 'foo', 1613030900], ['foo', 'foo', 1613030985]]
df = pd.DataFrame(data, columns = ['Var1', 'Var2', 'ts'])
df
The idea is to add a separate column called group
that will increment by 1 only when a change is detected between Var1
and Var2
over time (starting from the earliest timestamp).
The output would look something like this:
Var1 Var2 ts group
0 foo foo 1613030200 0
1 foo foo 1613030300 0
2 foo bar 1613030400 1
3 foo foo 1613030500 1
4 foo foo 1613030600 1
5 bar foo 1613030700 2
6 foo foo 1613030800 2
7 foo foo 1613030900 2
8 foo foo 1613030985 2
I have tried using a lambda function to increment, though this throws an error:
counter = 0
df[['Var1','Var2']].apply(lambda x: counter +=1 if x['Var1']!=x['Var2'] else counter, axis=1)
Is there an efficient method of incrementing over rows by conditions that involve checking multiple columns in python (without a for loop)? In SQL, the equivalent could be done with a window function such as:
SUM(
CASE WHEN Var1 <> Var2 THEN 1 ELSE 0 END
) OVER (ORDER BY ts) AS group
Upvotes: 1
Views: 1421
Reputation: 150785
Looks like just a cumsum
:
df['groups'] = df['Var1'].ne(df['Var2']).cumsum()
Output:
Var1 Var2 ts groups
0 foo foo 1613030200 0
1 foo foo 1613030300 0
2 foo bar 1613030400 1
3 foo foo 1613030500 1
4 foo foo 1613030600 1
5 bar foo 1613030700 2
6 foo foo 1613030800 2
7 foo foo 1613030900 2
8 foo foo 1613030985 2
Upvotes: 5