Dan Scally
Dan Scally

Reputation: 2042

Concatenate values from earlier rows in a pandas dataframe

I have a slightly odd pandas group by question.

I have a source dataframe, which has three columns: Customer, Date and Item. I want to add a new column that contains Item History, being an array of all the Items for that Customer that are in earlier (defined by the Date) rows. For example given this source dataframe:

Customer    Date    Item
Bert    01/01/2019  Bread
Bert    15/01/2019  Cheese
Bert    20/01/2019  Apples
Bert    22/01/2019  Pears
Ernie   01/01/2019  Buzz Lightyear
Ernie   15/01/2019  Shellfish
Ernie   20/01/2019  A pet dog
Ernie   22/01/2019  Yoghurt
Steven  01/01/2019  A golden toilet
Steven  15/01/2019  Dominoes

I want to create this history feature:

Customer    Date    Item    Item History
Bert    01/01/2019  Bread   NaN
Bert    15/01/2019  Cheese  [Bread]
Bert    20/01/2019  Apples  [Bread, Cheese]
Bert    22/01/2019  Pears   [Bread, Cheese, Apples]
Ernie   01/01/2019  Buzz Lightyear  NaN
Ernie   15/01/2019  Shellfish   [Buzz Lightyear]
Ernie   20/01/2019  A pet dog   [Buzz Lightyear, Shellfish]
Ernie   22/01/2019  Yoghurt [Buzz Lightyear, Shellfish, A pet dog]
Steven  01/01/2019  A golden toilet NaN
Steven  15/01/2019  Dominoes    [A golden toilet]

I can do the following to get the History by date:

df.groupby(['Customer', 'Date']).agg(lambda x: tuple(x)).applymap(list).reset_index()

So that where a customer purchased multiple items on a single day, they're all listed in an array and where a customer purchased just one item that's alone in its own array, but I can't figure out how to concatenate them with the earlier rows.

Upvotes: 2

Views: 491

Answers (2)

Dan Scally
Dan Scally

Reputation: 2042

I used @jezrael's answer for a significant amount of time, but with the dataset sizes that I had it was in the end much too slow. To improve that, I created a function that performs the same thing:

def buildItemHistoryPy(customers, items):

    output = []
    customer_ix = 0

    for i in range(len(customers)):
        if customers[i] == customers[i-1]:
            output.append(items[customer_ix:i])
        else:
            customer_ix = i
            output.append(items[customer_ix:i])

    return output

df['Item History'] = buildItemHistoryPy(df.CustomerAccountNum.values, df.ItemId.values)

My intent was to use that as the basis of a Cython function (which I expected to be much faster), but to my surprise the bare python function is significantly faster in its own right. I went ahead and Cythonised it anyway:

%%cython
import numpy as np
cimport numpy as np

cpdef list buildItemHistoryCy(np.ndarray customers, np.ndarray items):

    cdef list output = []
    cdef int customer_ix = 0

    for i in range(len(customers)):
        if customers[i] == customers[i-1]:
            output.append(items[customer_ix:i])
        else:
            customer_ix = i
            output.append(items[customer_ix:i])

    return output

And the outcome is essentially that either function is faster, but Cython is the best by a moderate amount:

%timeit -n5 df['Item History1'] = [x.ItemID[:i].tolist() for j, x in df.groupby('CustomerAccountNum') for i in range(len(x))]
%timeit -n5 df['Item History2'] = buildItemHistoryPy(df.CustomerAccountNum.values, df.ItemID.values)
%timeit -n5 df['Item History3'] = buildItemHistoryCy(df.CustomerAccountNum.values, df.ItemID.values)

7.46 s ± 346 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
53.5 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
23.6 ms ± 2.53 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)

My requirements have changed slightly such that the nulling of empty lists is no longer necessary. If it were, the functions would have to change such that you're appending items[customer_ix:i].tolist() instead.

Upvotes: 0

jezrael
jezrael

Reputation: 862521

Use custom lambda function with GroupBy.transform, last replace empty lists to NaNs:

f = lambda x: [x[:i].tolist() for i in range(len(x))]
df['Item History'] = df.groupby('Customer')['Item'].transform(f)

Another solution with list comprehension:

df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer') 
                                          for i in range(len(x))]

df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan

print (df)
  Customer        Date             Item  \
0     Bert  01/01/2019            Bread   
1     Bert  15/01/2019           Cheese   
2     Bert  20/01/2019           Apples   
3     Bert  22/01/2019            Pears   
4    Ernie  01/01/2019   Buzz Lightyear   
5    Ernie  15/01/2019        Shellfish   
6    Ernie  20/01/2019        A pet dog   
7    Ernie  22/01/2019          Yoghurt   
8   Steven  01/01/2019  A golden toilet   
9   Steven  15/01/2019         Dominoes   

                             Item History  
0                                     NaN  
1                                 [Bread]  
2                         [Bread, Cheese]  
3                 [Bread, Cheese, Apples]  
4                                     NaN  
5                        [Buzz Lightyear]  
6             [Buzz Lightyear, Shellfish]  
7  [Buzz Lightyear, Shellfish, A pet dog]  
8                                     NaN  
9                       [A golden toilet]  

Upvotes: 4

Related Questions