user3378649
user3378649

Reputation: 5354

How can I count frequency of two items using two columns using Pandas

I have a very large file (5GB), and I need to count the number of occurence using two columns

     a   b   c   d   e
0    2   3   1   5   4
1    2   3   2   5   4
2    1   3   2   5   4
3    2   4   1   5   3
4    2   4   1   5   3

so obviously I have to find

(2,3):2
(1,3):1
(2,4):2

How can I do that in a very fast way.

I used:

df.groupby(['a','b']).count().to_dict() 

Let's say that the final result would be

a b freq
2 3 2
1 3 1
2 4 2

Upvotes: 3

Views: 2446

Answers (1)

Mike Müller
Mike Müller

Reputation: 85432

Approach for the first version of the question - dictionary as result

If you have high frequencies, i.e. few combinations of a and b, the final dictionary will be small. If you have many of different combinations, you will need lots of RAM.

If you have low frequencies and enough RAM, looks like your approach is good.

Some timings for 5e6 rows and numbers from 0 to 19:

>>> df = pd.DataFrame(np.random.randint(0, 19, size=(5000000, 5)), columns=list('abcde'))
>>> df.shape
(5000000, 5)

%timeit df.groupby(['a','b']).count().to_dict() 
1 loops, best of 3: 552 ms per loop

%timeit  df.groupby(['a','b']).size()
1 loops, best of 3: 619 ms per loop

%timeit  df.groupby(['a','b']).count()
1 loops, best of 3: 588 ms per loop

Using a different range of integers, here up to sys.maxsize (9223372036854775807), changes the timings considerably:

import sys
df = pd.DataFrame(np.random.randint(0, high=sys.maxsize, size=(5000000, 5)), 
                  columns=list('abcde'))


%timeit df.groupby(['a','b']).count().to_dict() 
1 loops, best of 3: 41.3 s per loop

%timeit  df.groupby(['a','b']).size()
1 loops, best of 3: 11.4 s per loop

%timeit  df.groupby(['a','b']).count()
1 loops, best of 3: 12.9 s per loop`

Solution for the updated question

df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True) 
print(g)

   a  b  feq
0  1  3    1
1  2  3    2
2  2  4    2

It is not much faster though.

For range 0 to 19:

%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True) 

1 loops, best of 3: 564 ms per loop

For range 0 to sys.maxsize:

%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True) 
1 loops, best of 3: 10.2 s per loop

Upvotes: 2

Related Questions