Reputation: 43
I have 6 million transaction data so I need some functions to run this fast. Basically, I have unique customer IDs and the car class they reserved and actually drove at the end. Customers may have one or more renting car experiences. For a specific customer at each time point, I want to calculate how many unique different car class experiences he/she has, combining unique car class (reserved and drove)
In fact, my data is even not in this order, which means that the id and the dates are unsorted. The layout showed below is for convenience. It would be nice if you can also handle the unsorted problem!
Thank you!
The data looks like this:
id date reserved drove
1 2017 A B
1 2018 B A
1 2019 A C
2 2017 A B
2 2018 C D
3 2018 D D
I want this result:
id date experience
1 2017 2 #(A+B)
1 2018 2 #still the same as 2017 because this customer just experienced A and B (A+B)
1 2019 3 #one more experience because C is new car class (A+B+C)
2 2017 2 #(A+B)
2 2018 4 #(A+B+C+D)
3 2018 1 #(D)
Upvotes: 3
Views: 100
Reputation: 6091
It can be done with two lines (and I'm pretty sure someone can pull it off in one line):
Create a list of all observed values for both reserved and drove and then count the contents (using cumsum)
df['aux'] = list(map(list, zip(df.reserved, df.drove)))
df['aux_cum'] = [len(set(x)) for x in df.groupby('id')['aux'].apply(lambda x: x.cumsum())]
Output:
id date reserved drove aux aux_cum
0 1 2017 A B [A, B] 2
1 1 2018 B A [B, A] 2
2 1 2019 A C [A, C] 3
3 2 2017 A B [A, B] 2
4 2 2018 C D [C, D] 4
5 3 2018 D D [D, D] 1
Pretty format:
print(df.drop(['reserved','drove','aux'], axis=1)
id date aux_cum
0 1 2017 2
1 1 2018 2
2 1 2019 3
3 2 2017 2
4 2 2018 4
5 3 2018 1
Upvotes: 1
Reputation: 88246
Here's a numpy based approach:
import numpy as np
# sort values column-wise
df[['reserved','drove']] = np.sort(df[['reserved','drove']])
# sort values by id, reserved and drove
df = df.sort_values(['id','reserved','drove'])
And now lets define some conditions with which to obtain the expected output:
# Does the id change?
c1 = df.id.ne(df.id.shift()).values
# is the next row the same? (for each col individually)
c2 = (df[['reserved','drove']].ne(df[['reserved','drove']].shift(1))).values
# Is the value in "drove" the same?
c3 = (df[['reserved','drove']].ne(df[['reserved','drove']].shift(1, axis=1))).values
df['experience'] = ((c2 + c1[:,None]) * c3).sum(1)
df = df[['id','date']].assign(experience = df.groupby('id').experience.cumsum())
print(df)
id date experience
0 1 2017 2
1 1 2018 2
2 1 2019 3
3 2 2017 2
4 2 2018 4
5 3 2018 1
Upvotes: 1
Reputation: 1416
How about this? Uses list comprehension since pandas DF isn't great for dealing with sets (which is what this problem ultimately is).
df = pd.DataFrame([
[1, 2017, 'a', 'b'],
[1, 2018, 'a', 'b'],
[1, 2019, 'a', 'c'],
[2, 2017, 'a', 'b'],
[2, 2018, 'c', 'd'],
[3, 2018, 'd', 'd'],
], columns=['id', 'date', 'reserved', 'drove'])
list_of_sets = [(v[0], v[1], {v[2], v[3]}) for v in df.values]
sorted_list = sorted(list_of_sets) # not necc if sorted before
result = pd.DataFrame([
(info[0], info[1], len(info[2].union(sorted_list[i-1][2])))
if info[0] == sorted_list[i-1][0]
else (info[0], info[1], len(info[2]))
for i, info in enumerate(sorted_list)
], columns=['id', 'date', 'count'])
Upvotes: 1