Reputation: 11
I have a DataFrame for which I want to calculate, for each row, how many other rows match a given condition (e.g. number of rows that have value in column C less than the value for this row). Iterating through each row is too slow (I have ~1B rows), especially when the columns dtype is a datetime, but this is the way it could be run on a DataFrame df with a column labeled C:
df['newcol'] = 0
for row in df.itertuples():
df.loc[row.Index, 'newcol'] = len(df[df.C < row.C])
Is there a way to vectorize this?
Thanks!
Upvotes: 0
Views: 4188
Reputation: 819
Preparation:
import numpy as np
import pandas as pd
count = 5000
np.random.seed(100)
data = np.random.randint(100, size=count)
df = pd.DataFrame({'Col': list('ABCDE') * (count/5),
'Val': data})
Suggestion:
u, c = np.unique(data, return_counts=True)
values = np.cumsum(c)
dictionary = dict(zip(u[1:], values[:-1]))
dictionary[u[0]] = 0
df['newcol'] = [dictionary[x] for x in data]
It does exactly the same as your example. If it does not help. Write more detailed question.
Recommendations:
Pandas vectorization and jit-compiling are available with numba at page .
If you work with 1d arrays - use numpy. In many situations it works faster. Just compare that:
Pandas
%timeit df['newcol2'] = df.apply(lambda x: sum(df['Val'] < x.Val), axis=1)
1 loop, best of 3: 51.1 s per loop 204.34800005
Numpy
%timeit df['newcol3'] = [np.sum(data<x) for x in data]
10 loops, best of 3: 61.3 ms per loop 2.5490000248
Use numpy.sum instead of sum!
Upvotes: 2
Reputation: 107567
Consider pandas.DataFrame.apply with a lambda expression to count the rows to your condition. Admittedly, apply
is a loop and to run across ~1 billion rows may take time to process.
import numpy as np
import pandas as pd
np.random.seed(161)
df = pd.DataFrame({'Col': list('ABCDE') * 3,
'Val': np.random.randint(100, size=15)})
df['newcol'] = df.apply(lambda x: sum(df['Val'] < x.Val), axis=1)
# Col Val Count
# 0 A 78 13
# 1 B 11 2
# 2 C 51 8
# 3 D 31 5
# 4 E 29 4
# 5 A 99 14
# 6 B 65 10
# 7 C 16 3
# 8 D 43 7
# 9 E 10 1
# 10 A 67 11
# 11 B 36 6
# 12 C 1 0
# 13 D 73 12
# 14 E 64 9
Upvotes: 0