ricardoaraujo
ricardoaraujo

Reputation: 11

Pandas: for each row in a DataFrame, count the number of rows matching a condition

I have a DataFrame for which I want to calculate, for each row, how many other rows match a given condition (e.g. number of rows that have value in column C less than the value for this row). Iterating through each row is too slow (I have ~1B rows), especially when the columns dtype is a datetime, but this is the way it could be run on a DataFrame df with a column labeled C:

df['newcol'] = 0
for row in df.itertuples():
    df.loc[row.Index, 'newcol'] = len(df[df.C < row.C])

Is there a way to vectorize this?

Thanks!

Upvotes: 0

Views: 4188

Answers (2)

Konstantin Purtov
Konstantin Purtov

Reputation: 819

Preparation:

import numpy as np
import pandas as pd
count = 5000

np.random.seed(100)
data = np.random.randint(100, size=count)

df = pd.DataFrame({'Col': list('ABCDE') * (count/5),
                   'Val': data})

Suggestion:

u, c = np.unique(data, return_counts=True)
values = np.cumsum(c)
dictionary = dict(zip(u[1:], values[:-1]))
dictionary[u[0]] = 0
df['newcol'] = [dictionary[x] for x in data]

It does exactly the same as your example. If it does not help. Write more detailed question.

Recommendations:

Pandas vectorization and jit-compiling are available with numba at page .

If you work with 1d arrays - use numpy. In many situations it works faster. Just compare that:

Pandas

%timeit df['newcol2'] = df.apply(lambda x: sum(df['Val'] < x.Val), axis=1)

1 loop, best of 3: 51.1 s per loop 204.34800005

Numpy

%timeit df['newcol3'] = [np.sum(data<x) for x in data]

10 loops, best of 3: 61.3 ms per loop 2.5490000248

Use numpy.sum instead of sum!

Upvotes: 2

Parfait
Parfait

Reputation: 107567

Consider pandas.DataFrame.apply with a lambda expression to count the rows to your condition. Admittedly, apply is a loop and to run across ~1 billion rows may take time to process.

import numpy as np
import pandas as pd

np.random.seed(161)

df = pd.DataFrame({'Col': list('ABCDE') * 3,
                   'Val': np.random.randint(100, size=15)})

df['newcol'] = df.apply(lambda x: sum(df['Val'] < x.Val), axis=1)

#    Col  Val  Count
# 0    A   78     13
# 1    B   11      2
# 2    C   51      8
# 3    D   31      5
# 4    E   29      4
# 5    A   99     14
# 6    B   65     10
# 7    C   16      3
# 8    D   43      7
# 9    E   10      1
# 10   A   67     11
# 11   B   36      6
# 12   C    1      0
# 13   D   73     12
# 14   E   64      9

Upvotes: 0

Related Questions