Reputation: 5354
I have a very large file (5GB), and I need to count the number of occurence using two columns
a b c d e
0 2 3 1 5 4
1 2 3 2 5 4
2 1 3 2 5 4
3 2 4 1 5 3
4 2 4 1 5 3
so obviously I have to find
(2,3):2
(1,3):1
(2,4):2
How can I do that in a very fast way.
I used:
df.groupby(['a','b']).count().to_dict()
Let's say that the final result would be
a b freq
2 3 2
1 3 1
2 4 2
Upvotes: 3
Views: 2446
Reputation: 85432
If you have high frequencies, i.e. few combinations of a
and b
, the final dictionary will be small. If you have many of different combinations, you will need lots of RAM.
If you have low frequencies and enough RAM, looks like your approach is good.
Some timings for 5e6 rows and numbers from 0 to 19:
>>> df = pd.DataFrame(np.random.randint(0, 19, size=(5000000, 5)), columns=list('abcde'))
>>> df.shape
(5000000, 5)
%timeit df.groupby(['a','b']).count().to_dict()
1 loops, best of 3: 552 ms per loop
%timeit df.groupby(['a','b']).size()
1 loops, best of 3: 619 ms per loop
%timeit df.groupby(['a','b']).count()
1 loops, best of 3: 588 ms per loop
Using a different range of integers, here up to sys.maxsize
(9223372036854775807), changes the timings considerably:
import sys
df = pd.DataFrame(np.random.randint(0, high=sys.maxsize, size=(5000000, 5)),
columns=list('abcde'))
%timeit df.groupby(['a','b']).count().to_dict()
1 loops, best of 3: 41.3 s per loop
%timeit df.groupby(['a','b']).size()
1 loops, best of 3: 11.4 s per loop
%timeit df.groupby(['a','b']).count()
1 loops, best of 3: 12.9 s per loop`
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True)
print(g)
a b feq
0 1 3 1
1 2 3 2
2 2 4 2
It is not much faster though.
For range 0 to 19:
%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True)
1 loops, best of 3: 564 ms per loop
For range 0 to sys.maxsize
:
%%timeit
df2 = df.drop(list('cd'), axis=1)
df2.rename(columns={'e': 'feq'}, inplace=True)
g = df2.groupby(['a','b']).count()
g.reset_index(inplace=True)
1 loops, best of 3: 10.2 s per loop
Upvotes: 2