Reputation: 5437
I have a csv file with transaction data of the following form
import pandas as pd
df = pd.DataFrame({'OrderID':[1,1,1,1,2,2], 'ItemID':[1,2,3,4,1,2]})
print(df)
ItemID OrderID
0 1 1
1 2 1
2 3 1
3 4 1
4 1 2
5 2 2
I want to obtain a list that contains for every OrderID the sets of items.
This can be obtained with
df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist()
[{1, 2, 3, 4}, {1, 2}]
However, on a csv file with 9 million rows this takes some time. Thus, I'm wondering if there is a faster way? I'm interested in any solution with pandas or that operates directly on a .csv-file
First of all I want to thank you guys, for your awesome input! I took a sample of 50000 OrderIds (and the corresponding items) from my real data and applied several of the methods from to the data set. And here are the results
Note that I used the updated version of the pir programm. So the winner is divakar, even if we only consider the list of sets output.
On my whole data set, his faster set approach has a duration of 5.05 seconds and his faster list based approach a duration of only 2.32s. That is a huge gain from the initial 115 seconds! Thanks again!
Upvotes: 4
Views: 960
Reputation: 221514
Approach #1 : Using array splitting and set
-
def divakar_v1(df):
a = df.values
sidx = a[:,1].argsort() # Use .argsort(kind='mergesort') to keep order
cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
out = np.split(a[sidx,0], cut_idx)
return list(map(set,out))
Approach #2 : Using iterative array slicing and set
-
def divakar_v2(df):
data = df.values
a = data[data[:,1].argsort()] # Use .argsort(kind='mergesort') to keep order
stop = np.append(np.nonzero(a[1:,1] > a[:-1,1])[0]+1,a.size)
start = np.append(0, stop[:-1])
out_set = [set(a[start[i]:stop[i],0]) for i in range(len(start))]
return out_set
Given that per 'OrderID'
, we would have unique/distinct elements in 'ItemID'
, we would have two more approaches skipping the use of set()
and thus giving us a list of lists as output. These are listed next.
Approach #3 : Using array splitting and list of lists as o/p
-
def divakar_v3(df):
a = df.values
sidx = a[:,1].argsort() # Use .argsort(kind='mergesort') to keep order
cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
out = np.split(a[sidx,0], cut_idx)
return list(map(list,out))
Approach #4 : Using iterative array slicing and list of lists as o/p
-
def divakar_v4(df):
data = df.values
a = data[data[:,1].argsort()] # Use .argsort(kind='mergesort') to keep order
stop = np.append(np.nonzero(a[1:,1] > a[:-1,1])[0]+1,a.size)
start = np.append(0, stop[:-1])
a0 = a[:,0].tolist()
return [a0[start[i]:stop[i]] for i in range(len(start))]
Runtime test -
In [145]: np.random.seed(123)
...: N = 100000
...: df = pd.DataFrame(np.random.randint(30,size=(N,2)))
...: df.columns = ['ItemID','OrderID']
...:
In [146]: %timeit divakar_v1(df)
...: %timeit divakar_v2(df)
...: %timeit divakar_v3(df)
...: %timeit divakar_v4(df)
...:
10 loops, best of 3: 21.1 ms per loop
10 loops, best of 3: 21.7 ms per loop
100 loops, best of 3: 16.7 ms per loop
100 loops, best of 3: 12.3 ms per loop
Upvotes: 3
Reputation: 294218
new method
defaultdict
from collections import defaultdict
def pir(df):
d = defaultdict(set)
for n, g in df.groupby('OrderID').ItemID:
d[n].update(g.values.tolist())
return list(d.values())
sample
df = pd.DataFrame(dict(OrderID=np.random.randint(0, 1000, 10000000),
ItemID=np.random.randint(0, 1000, 10000000)))
old method
uo, io = np.unique(df.OrderID.values, return_inverse=True)
ui, ii = np.unique(df.ItemID.values, return_inverse=True)
def gu(i):
return set(ui[ii[io == i]].tolist())
[gu(i) for i in range(len(uo))]
[{1, 2, 3, 4}, {1, 2}]
old timing
code:
def pir(df):
uo, io = np.unique(df.OrderID.values, return_inverse=True)
ui, ii = np.unique(df.ItemID.values, return_inverse=True)
def gu(i):
return set(ui[ii[io == i]].tolist())
return [gu(i) for i in range(len(uo))]
def jez(df):
arr = df.groupby('OrderID')['ItemID'].unique().values
return [set(v) for v in arr]
def div(df):
a = df.values
sidx = a[:,1].argsort(kind='mergesort')
cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
out = np.split(a[sidx,0], cut_idx)
return list(map(set,out))
def quik(df):
return df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist()
with more data
df = pd.DataFrame(dict(OrderID=np.random.randint(0, 10, 10000),
ItemID=np.random.randint(0, 10, 10000)))
even more data
df = pd.DataFrame(dict(OrderID=np.random.randint(0, 10, 10000000),
ItemID=np.random.randint(0, 10, 10000000)))
Upvotes: 5
Reputation: 862511
You can try SeriesGroupBy.unique
, then convert to numpy array
and last to set
by list comprehension
:
arr = df.groupby('OrderID')['ItemID'].unique().values
print (arr)
[array([1, 2, 3, 4], dtype=int64) array([1, 2], dtype=int64)]
print ([set(v) for v in arr])
[{1, 2, 3, 4}, {1, 2}]
EDIT Faster is use unique
in apply
:
print (df.groupby('OrderID').apply(lambda x: set(x['ItemID'].unique())).tolist())
Timings:
np.random.seed(123)
N = 1000000
df = pd.DataFrame(np.random.randint(30,size=(N,2)))
df.columns = ['OrderID','ItemID']
def pir(df):
uo, io = np.unique(df.OrderID.values, return_inverse=True)
ui, ii = np.unique(df.ItemID.values, return_inverse=True)
def gu(i):
return set(ui[ii[io == i]].tolist())
return [gu(i) for i in range(len(uo))]
def divakar(df):
a = df.values
sidx = a[:,1].argsort(kind='mergesort')
cut_idx = np.nonzero(a[sidx[1:],1] > a[sidx[:-1],1])[0]+1
out = np.split(a[sidx,0], cut_idx)
return list(map(set,out))
In [120]: %timeit (df.groupby('OrderID')
.apply(lambda x: set(x['ItemID'].unique())).tolist())
10 loops, best of 3: 92.7 ms per loop
In [121]: %timeit (df.groupby('OrderID').apply(lambda x: set(x['ItemID'])).tolist())
10 loops, best of 3: 168 ms per loop
In [122]: %timeit ([set(v) for v in df.groupby('OrderID')['ItemID'].unique().values])
10 loops, best of 3: 125 ms per loop
In [123]: %timeit (list(map(set,df.groupby('OrderID')['ItemID'].unique().values)))
10 loops, best of 3: 125 ms per loop
In [124]: %timeit (pir(df))
1 loop, best of 3: 276 ms per loop
In [125]: %timeit (divakar(df))
1 loop, best of 3: 190 ms per loop
Upvotes: 3