Shuqi
Shuqi

Reputation: 43

how to count unique values from two columns based on another column? (per ID)

I have 6 million transaction data so I need some functions to run this fast. Basically, I have unique customer IDs and the car class they reserved and actually drove at the end. Customers may have one or more renting car experiences. For a specific customer at each time point, I want to calculate how many unique different car class experiences he/she has, combining unique car class (reserved and drove)

In fact, my data is even not in this order, which means that the id and the dates are unsorted. The layout showed below is for convenience. It would be nice if you can also handle the unsorted problem!

Thank you!

The data looks like this:

id  date reserved drove
1   2017    A       B
1   2018    B       A
1   2019    A       C
2   2017    A       B
2   2018    C       D
3   2018    D       D

I want this result:

id  date  experience
1   2017     2 #(A+B)
1   2018     2 #still the same as 2017 because this customer just experienced A and B (A+B)
1   2019     3 #one more experience because C is new car class (A+B+C)
2   2017     2 #(A+B)
2   2018     4 #(A+B+C+D)
3   2018     1 #(D)

Upvotes: 3

Views: 100

Answers (3)

Yuca
Yuca

Reputation: 6091

It can be done with two lines (and I'm pretty sure someone can pull it off in one line):
Create a list of all observed values for both reserved and drove and then count the contents (using cumsum)

df['aux'] = list(map(list, zip(df.reserved, df.drove)))
df['aux_cum'] = [len(set(x)) for x in df.groupby('id')['aux'].apply(lambda x: x.cumsum())]

Output:

   id  date reserved drove     aux  aux_cum
0   1  2017        A     B  [A, B]        2
1   1  2018        B     A  [B, A]        2
2   1  2019        A     C  [A, C]        3
3   2  2017        A     B  [A, B]        2
4   2  2018        C     D  [C, D]        4
5   3  2018        D     D  [D, D]        1

Pretty format:

print(df.drop(['reserved','drove','aux'], axis=1)

   id  date  aux_cum
0   1  2017        2
1   1  2018        2
2   1  2019        3
3   2  2017        2
4   2  2018        4
5   3  2018        1

Upvotes: 1

yatu
yatu

Reputation: 88246

Here's a numpy based approach:

import numpy as np
# sort values column-wise
df[['reserved','drove']] = np.sort(df[['reserved','drove']])
# sort values by id, reserved and drove
df = df.sort_values(['id','reserved','drove'])

And now lets define some conditions with which to obtain the expected output:

# Does the id change?
c1 = df.id.ne(df.id.shift()).values
# is the next row the same? (for each col individually)
c2 = (df[['reserved','drove']].ne(df[['reserved','drove']].shift(1))).values
# Is the value in "drove" the same?
c3 = (df[['reserved','drove']].ne(df[['reserved','drove']].shift(1, axis=1))).values

df['experience'] = ((c2 + c1[:,None]) * c3).sum(1)
df = df[['id','date']].assign(experience = df.groupby('id').experience.cumsum())

print(df)

   id  date  experience
0   1  2017           2
1   1  2018           2
2   1  2019           3
3   2  2017           2
4   2  2018           4
5   3  2018           1

Upvotes: 1

hchw
hchw

Reputation: 1416

How about this? Uses list comprehension since pandas DF isn't great for dealing with sets (which is what this problem ultimately is).

df = pd.DataFrame([
    [1, 2017, 'a', 'b'],
    [1, 2018, 'a', 'b'],
    [1, 2019, 'a', 'c'],
    [2, 2017, 'a', 'b'],
    [2, 2018, 'c', 'd'],
    [3, 2018, 'd', 'd'],
], columns=['id', 'date', 'reserved', 'drove'])

list_of_sets = [(v[0], v[1], {v[2], v[3]}) for v in df.values]

sorted_list = sorted(list_of_sets)  # not necc if sorted before

result = pd.DataFrame([
    (info[0], info[1], len(info[2].union(sorted_list[i-1][2])))
    if info[0] == sorted_list[i-1][0] 
    else (info[0], info[1], len(info[2]))
    for i, info in enumerate(sorted_list)
], columns=['id', 'date', 'count'])

Upvotes: 1

Related Questions