Reputation: 11192
I have a pandas dataframe like below,
col1 col2
0 12 1
1 1 7
2 54 17
3 11 191
4 3 39
5 76 2
6 18 6
code to generate df:
df=pd.DataFrame({'col1':[12,1,54,11,3,76,18],'col2':[1,7,17,191,39,2,6]})
I would like to compare col1 values one by one with complete col2 series. i.e., compare 12 with col2 find less than 12 in col2 and count the values, next do the same thing for 1, and then 54 and so on and store the result in another series.
So far I tried like below,
df['res']=df.apply(lambda x:len(df[df['col2']<x['col1']]),axis=1)
It works as i expected. But it's ineffective way to solve this it sucks when the series is large.
I need efficient way to solve this problem. Because Actual dataset contains more than million records.
Expected Output:
col1 col2 res
0 12 1 4
1 1 7 0
2 54 17 6
3 11 191 4
4 3 39 2
5 76 2 6
6 18 6 5
Upvotes: 2
Views: 1393
Reputation: 92854
Alternative solution with np.less
logic function:
In [119]: vals = df['col2'].values
In [120]: df['res'] = df.apply(lambda x: np.less(vals, x['col1']).sum(), axis=1)
In [121]: df
Out[121]:
col1 col2 res
0 12 1 4
1 1 7 0
2 54 17 6
3 11 191 4
4 3 39 2
5 76 2 6
6 18 6 5
Performance comparison:
In [122]: %timeit df['res'] = df.apply(lambda x: np.less(vals, x['col1']).sum(), axis=1)
2.09 ms ± 308 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [123]: %timeit df['res']=df.apply(lambda x:len(df[df['col2']<x['col1']]),axis=1)
8.57 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [124]: %timeit df['res'] = (df['col2'].values.reshape(1,-1) < df['col1'].values.reshape(-1,1)).sum(axis=1)
420 µs ± 26.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.less.html#numpy.less
Upvotes: 1
Reputation: 1879
The following uses numpy (by implicitly extending the vectors to a matrix using broadcasting) and works much faster than your proposed answer:
df['res'] = (df['col2'].values.reshape(1,-1) < df['col1'].values.reshape(-1,1)).sum(axis=1)
(in a test df with 10k rows, on my machine it takes 0.3s instead of 8s). However it uses quadratic memory in the number of rows, so if your df has millions of rows that's not great...
[EDIT] There is a solution in O(n*log(n)) (n being the number of rows) in both time and space which is probably close to optimal (the above is O(n^2) in both, implementing it in C would be O(n^2) in time but only O(n) in space), but I haven't written the code as it gets tiresome especially to handle equality cases, etc.. The pseudocode is the following:
[EDIT2]: Implementing it was actually much easier than I thought, it's just:
idx1 = np.argsort(np.argsort(df['col1'], kind='mergesort'), kind='mergesort')
idx2 = np.argsort(np.argsort(np.concatenate((df['col1'], df['col2'])), kind='mergesort'), kind='mergesort')[:len(idx1)]
df['res'] = idx2-idx1
As said, this is just O(n*log(n)) in both time and space, so even with a large df it takes very little time (0.1s for 100k rows, 1.5s for 1M rows) and very little additional space.
The double argsort is because of numpy sorting convention, np.argsort doesn't give the index of the element in the sorted vector but rather the index such that x[idx] is sorted. The small trick of doing the argsort twice gives the position of the original element in the sorted vector. I added the kind='mergesort' to use stable sorting, this is pretty useless by itself but should fix problems if a value appears in both col1 and col2 (that is because we want to count when col2 is < col1; if we wanted <=, then in the concatenation col2 should go before col1).
Upvotes: 5