Reputation: 273
I am trying to look at a specific feature over time for different unique ids, and have it stored in a dataframe in Pandas.
Here is an example with code to replicate:
d = {'id': ['adam', 'john'],'t1': ['A', 'A'], 't2': ['A', 'B'], 't3': ['A', 'B'], 't4': ['B', 'A']}
df = pd.DataFrame(data=d)
df
id t1 t2 t3 t4
0 adam A A A B
1 john A B B A
I would like to count the number of times the t* value changes for a specific id. Example:
John value starts at A and moves to B (one change), then stays at B (no change), then moves to A in t4 (second change), so there were two total changes.
Expected output is as follows:
id t1 t2 t3 t4 toatal_change
0 adam A A A B 1
1 john A B B A 2
Upvotes: 2
Views: 84
Reputation: 51185
s = df[df.columns[1:]]
df.assign(total_change=s.ne(s.shift(axis=1).bfill(1)).sum(1))
Output:
id t1 t2 t3 t4 total_change
0 adam A A A B 1
1 john A B B A 2
This will be slower than it's numpy
equivalent:
df = pd.concat([df]*10000)
s = df[df.columns[1:]]
v = df.filter(regex='^t\d+').values
%%timeit
df.assign(total_change=s.ne(s.shift(axis=1).bfill(1)).sum(1))
21.2 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df.assign(total_change=(v[:, 1:] != v[:, :-1]).sum(1))
1.9 ms ± 8.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Upvotes: 5
Reputation: 294516
Same idea but using Numpy
v = df.filter(regex='^t\d+').values
df.assign(total_change=(v[:, 1:] != v[:, :-1]).sum(1))
id t1 t2 t3 t4 total_change
0 adam A A A B 1
1 john A B B A 2
Upvotes: 4
Reputation: 6043
you could filter the columns and go over all the values in the row and check for change:
columns_needed = [col for col in df.columns.values if col.startswith('t')]
df['toatal_change'] = df[columns_needed].apply(lambda row: sum([1 for val, val2
in zip(row, row[1:]) if val != val2]),axis=1)
which results in:
id t1 t2 t3 t4 toatal_change
0 adam A A A B 1
1 john A B B A 2
the lambda
expression in apply
is equivalent to:
def check_chage(row):
is_eq_next_val = [1 for val, val2 in zip(row, row[1:]) if val != val2]
return sum(is_eq_next_val)
Upvotes: 1
Reputation: 153510
You can use:
df.merge((df.set_index('id').shift(1,axis=1).bfill(1) != df.set_index('id')).sum(1)
.rename('total_change')
.to_frame(),
left_on='id',
right_index=True)
Output:
id t1 t2 t3 t4 total_change
0 adam A A A B 1
1 john A B B A 2
Upvotes: 3