Quickbeam2k1
Quickbeam2k1

Reputation: 5437

From transaction data to list of sets in an efficient way

I have a csv file with transaction data of the following form

import pandas as pd
df = pd.DataFrame({'OrderID':[1,1,1,1,2,2], 'ItemID':[1,2,3,4,1,2]})
print(df)
   ItemID  OrderID
0       1        1
1       2        1
2       3        1
3       4        1
4       1        2
5       2        2

I want to obtain a list that contains for every OrderID the sets of items.

This can be obtained with

df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist()
[{1, 2, 3, 4}, {1, 2}]

However, on a csv file with 9 million rows this takes some time. Thus, I'm wondering if there is a faster way? I'm interested in any solution with pandas or that operates directly on a .csv-file


First of all I want to thank you guys, for your awesome input! I took a sample of 50000 OrderIds (and the corresponding items) from my real data and applied several of the methods from to the data set. And here are the results

BenchmarkResults

Note that I used the updated version of the pir programm. So the winner is divakar, even if we only consider the list of sets output.

On my whole data set, his faster set approach has a duration of 5.05 seconds and his faster list based approach a duration of only 2.32s. That is a huge gain from the initial 115 seconds! Thanks again!

Upvotes: 4

Views: 960

Answers (3)

Divakar
Divakar

Reputation: 221514

Approach #1 : Using array splitting and set -

def divakar_v1(df):
    a = df.values
    sidx = a[:,1].argsort() # Use .argsort(kind='mergesort') to keep order
    cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
    out = np.split(a[sidx,0], cut_idx)
    return list(map(set,out))

Approach #2 : Using iterative array slicing and set -

def divakar_v2(df):
    data = df.values
    a = data[data[:,1].argsort()] # Use .argsort(kind='mergesort') to keep order
    stop = np.append(np.nonzero(a[1:,1] > a[:-1,1])[0]+1,a.size)
    start = np.append(0, stop[:-1])
    out_set = [set(a[start[i]:stop[i],0]) for i in range(len(start))]
    return out_set

Given that per 'OrderID', we would have unique/distinct elements in 'ItemID', we would have two more approaches skipping the use of set() and thus giving us a list of lists as output. These are listed next.

Approach #3 : Using array splitting and list of lists as o/p -

def divakar_v3(df):
    a = df.values
    sidx = a[:,1].argsort() # Use .argsort(kind='mergesort') to keep order
    cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
    out = np.split(a[sidx,0], cut_idx)
    return list(map(list,out))

Approach #4 : Using iterative array slicing and list of lists as o/p -

def divakar_v4(df):
    data = df.values
    a = data[data[:,1].argsort()] # Use .argsort(kind='mergesort') to keep order
    stop = np.append(np.nonzero(a[1:,1] > a[:-1,1])[0]+1,a.size)
    start = np.append(0, stop[:-1])
    a0 = a[:,0].tolist()
    return [a0[start[i]:stop[i]] for i in range(len(start))]

Runtime test -

In [145]: np.random.seed(123)
     ...: N = 100000
     ...: df = pd.DataFrame(np.random.randint(30,size=(N,2)))
     ...: df.columns = ['ItemID','OrderID']
     ...: 

In [146]: %timeit divakar_v1(df)
     ...: %timeit divakar_v2(df)
     ...: %timeit divakar_v3(df)
     ...: %timeit divakar_v4(df)
     ...: 
10 loops, best of 3: 21.1 ms per loop
10 loops, best of 3: 21.7 ms per loop
100 loops, best of 3: 16.7 ms per loop
100 loops, best of 3: 12.3 ms per loop

Upvotes: 3

piRSquared
piRSquared

Reputation: 294218

new method
defaultdict

from collections import defaultdict

def pir(df):
    d = defaultdict(set)
    for n, g in df.groupby('OrderID').ItemID:
        d[n].update(g.values.tolist())

    return list(d.values())

sample

df = pd.DataFrame(dict(OrderID=np.random.randint(0, 1000, 10000000),
                       ItemID=np.random.randint(0, 1000, 10000000)))

enter image description here


old method

uo, io = np.unique(df.OrderID.values, return_inverse=True)
ui, ii = np.unique(df.ItemID.values, return_inverse=True)

def gu(i):
    return set(ui[ii[io == i]].tolist())

[gu(i) for i in range(len(uo))]

[{1, 2, 3, 4}, {1, 2}]

old timing
code:

def pir(df):
    uo, io = np.unique(df.OrderID.values, return_inverse=True)
    ui, ii = np.unique(df.ItemID.values, return_inverse=True)

    def gu(i):
        return set(ui[ii[io == i]].tolist())

    return [gu(i) for i in range(len(uo))]

def jez(df):
    arr = df.groupby('OrderID')['ItemID'].unique().values
    return [set(v) for v in arr]

def div(df):
    a = df.values
    sidx = a[:,1].argsort(kind='mergesort')
    cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
    out = np.split(a[sidx,0], cut_idx)
    return list(map(set,out))

def quik(df):
    return df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist()

with sample data
enter image description here

with more data

df = pd.DataFrame(dict(OrderID=np.random.randint(0, 10, 10000),
                       ItemID=np.random.randint(0, 10, 10000)))

enter image description here

even more data

df = pd.DataFrame(dict(OrderID=np.random.randint(0, 10, 10000000),
                       ItemID=np.random.randint(0, 10, 10000000)))

enter image description here

Upvotes: 5

jezrael
jezrael

Reputation: 862511

You can try SeriesGroupBy.unique, then convert to numpy array and last to set by list comprehension:

arr = df.groupby('OrderID')['ItemID'].unique().values
print (arr)
[array([1, 2, 3, 4], dtype=int64) array([1, 2], dtype=int64)]

print ([set(v) for v in arr])
[{1, 2, 3, 4}, {1, 2}]

EDIT Faster is use unique in apply:

print (df.groupby('OrderID').apply(lambda x: set(x['ItemID'].unique())).tolist())

Timings:

np.random.seed(123)
N = 1000000
df = pd.DataFrame(np.random.randint(30,size=(N,2)))
df.columns = ['OrderID','ItemID']

def pir(df):
    uo, io = np.unique(df.OrderID.values, return_inverse=True)
    ui, ii = np.unique(df.ItemID.values, return_inverse=True)
    def gu(i):
        return set(ui[ii[io == i]].tolist())
    return [gu(i) for i in range(len(uo))]

def divakar(df):
    a = df.values
    sidx = a[:,1].argsort(kind='mergesort')
    cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
    out = np.split(a[sidx,0], cut_idx)
    return list(map(set,out))
In [120]: %timeit (df.groupby('OrderID')
                     .apply(lambda x: set(x['ItemID'].unique())).tolist())
10 loops, best of 3: 92.7 ms per loop


In [121]: %timeit (df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist())
10 loops, best of 3: 168 ms per loop

In [122]: %timeit ([set(v) for v in df.groupby('OrderID')['ItemID'].unique().values])
10 loops, best of 3: 125 ms per loop

In [123]: %timeit (list(map(set,df.groupby('OrderID')['ItemID'].unique().values)))
10 loops, best of 3: 125 ms per loop

In [124]: %timeit (pir(df))
1 loop, best of 3: 276 ms per loop

In [125]: %timeit (divakar(df))
1 loop, best of 3: 190 ms per loop

Upvotes: 3

Related Questions