Reputation: 2042
I have a slightly odd pandas group by question.
I have a source dataframe, which has three columns: Customer, Date and Item. I want to add a new column that contains Item History, being an array of all the Items for that Customer that are in earlier (defined by the Date) rows. For example given this source dataframe:
Customer Date Item
Bert 01/01/2019 Bread
Bert 15/01/2019 Cheese
Bert 20/01/2019 Apples
Bert 22/01/2019 Pears
Ernie 01/01/2019 Buzz Lightyear
Ernie 15/01/2019 Shellfish
Ernie 20/01/2019 A pet dog
Ernie 22/01/2019 Yoghurt
Steven 01/01/2019 A golden toilet
Steven 15/01/2019 Dominoes
I want to create this history feature:
Customer Date Item Item History
Bert 01/01/2019 Bread NaN
Bert 15/01/2019 Cheese [Bread]
Bert 20/01/2019 Apples [Bread, Cheese]
Bert 22/01/2019 Pears [Bread, Cheese, Apples]
Ernie 01/01/2019 Buzz Lightyear NaN
Ernie 15/01/2019 Shellfish [Buzz Lightyear]
Ernie 20/01/2019 A pet dog [Buzz Lightyear, Shellfish]
Ernie 22/01/2019 Yoghurt [Buzz Lightyear, Shellfish, A pet dog]
Steven 01/01/2019 A golden toilet NaN
Steven 15/01/2019 Dominoes [A golden toilet]
I can do the following to get the History by date:
df.groupby(['Customer', 'Date']).agg(lambda x: tuple(x)).applymap(list).reset_index()
So that where a customer purchased multiple items on a single day, they're all listed in an array and where a customer purchased just one item that's alone in its own array, but I can't figure out how to concatenate them with the earlier rows.
Upvotes: 2
Views: 491
Reputation: 2042
I used @jezrael's answer for a significant amount of time, but with the dataset sizes that I had it was in the end much too slow. To improve that, I created a function that performs the same thing:
def buildItemHistoryPy(customers, items):
output = []
customer_ix = 0
for i in range(len(customers)):
if customers[i] == customers[i-1]:
output.append(items[customer_ix:i])
else:
customer_ix = i
output.append(items[customer_ix:i])
return output
df['Item History'] = buildItemHistoryPy(df.CustomerAccountNum.values, df.ItemId.values)
My intent was to use that as the basis of a Cython function (which I expected to be much faster), but to my surprise the bare python function is significantly faster in its own right. I went ahead and Cythonised it anyway:
%%cython
import numpy as np
cimport numpy as np
cpdef list buildItemHistoryCy(np.ndarray customers, np.ndarray items):
cdef list output = []
cdef int customer_ix = 0
for i in range(len(customers)):
if customers[i] == customers[i-1]:
output.append(items[customer_ix:i])
else:
customer_ix = i
output.append(items[customer_ix:i])
return output
And the outcome is essentially that either function is faster, but Cython is the best by a moderate amount:
%timeit -n5 df['Item History1'] = [x.ItemID[:i].tolist() for j, x in df.groupby('CustomerAccountNum') for i in range(len(x))]
%timeit -n5 df['Item History2'] = buildItemHistoryPy(df.CustomerAccountNum.values, df.ItemID.values)
%timeit -n5 df['Item History3'] = buildItemHistoryCy(df.CustomerAccountNum.values, df.ItemID.values)
7.46 s ± 346 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
53.5 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
23.6 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
My requirements have changed slightly such that the nulling of empty lists is no longer necessary. If it were, the functions would have to change such that you're appending items[customer_ix:i].tolist()
instead.
Upvotes: 0
Reputation: 862521
Use custom lambda function with GroupBy.transform
, last replace empty lists to NaN
s:
f = lambda x: [x[:i].tolist() for i in range(len(x))]
df['Item History'] = df.groupby('Customer')['Item'].transform(f)
Another solution with list comprehension:
df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer')
for i in range(len(x))]
df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan
print (df)
Customer Date Item \
0 Bert 01/01/2019 Bread
1 Bert 15/01/2019 Cheese
2 Bert 20/01/2019 Apples
3 Bert 22/01/2019 Pears
4 Ernie 01/01/2019 Buzz Lightyear
5 Ernie 15/01/2019 Shellfish
6 Ernie 20/01/2019 A pet dog
7 Ernie 22/01/2019 Yoghurt
8 Steven 01/01/2019 A golden toilet
9 Steven 15/01/2019 Dominoes
Item History
0 NaN
1 [Bread]
2 [Bread, Cheese]
3 [Bread, Cheese, Apples]
4 NaN
5 [Buzz Lightyear]
6 [Buzz Lightyear, Shellfish]
7 [Buzz Lightyear, Shellfish, A pet dog]
8 NaN
9 [A golden toilet]
Upvotes: 4