Reputation: 1099
I have 2 large columns of data (some 1.5million values). They are structured as :
col1 = [2,2,1,4,5,4,3,4,4,4,5,2,3,1,1 ..] etc.,
col2 = [1,1,8,8,3,5,6,7,2,3,10.........] etc.,
I want to make a joint count matrix CountsMAT. col1 has values ranging from 1 to 5 and col2 has a range of 1 to 10.
counts of [(1,2),(1,3),...(1,10),
(2,1), (2,2),....(2,10),
(3,1),......,(3,3)...(3,10),
...........................
(5,1),(5,2),...(5,5).....(5,10)] are required ie.,
a CountsMAT of size (5,10)
max(col1) = 5; max(col2) = 10 ;
I've implemented a version with defaultdict & for loop, but that takes a while. Am sure this could be more adeptly handled with Pandas directly and I'd appreciate an optimal implementation for it using numpy/pandas. many other similar queries point to 'groupby' of Pandas, but am not well-versed with it.
Getting the output in the matrix format is trickier also for my dict based implementation. but guess it will be easier with Pandas/numpy. Thanks!
Upvotes: 2
Views: 797
Reputation: 10769
The numpy_indexed package (disclaimer: I am its author) has a function to solve this problem efficiently:
import numpy_indexed as npi
print(npi.count_table(col1, col2))
Upvotes: 1
Reputation: 18541
This might work (using numpy.histogram2d
):
import numpy as np
col1 = np.random.random_integers(1, 5, 100)
col2 = np.random.random_integers(1, 10, 100)
bin1 = np.arange(0.5, 6.5, 1)
bin2 = np.arange(0.5, 11.5, 1)
mat = np.histogram2d(col1, col2, bins=[bin1, bin2])[0]
print mat.shape
print mat
which yields
(5, 10)
array([[ 4., 2., 0., 6., 2., 2., 1., 2., 1., 2.],
[ 3., 3., 3., 0., 3., 0., 1., 4., 1., 0.],
[ 4., 2., 1., 1., 3., 2., 5., 2., 2., 2.],
[ 1., 1., 3., 2., 3., 1., 4., 4., 0., 0.],
[ 0., 2., 1., 4., 3., 1., 3., 2., 0., 1.]])
I haven't tested it using lists with lengths over a couple thousand, but I think it should scale OK for a few million values.
Edit:
Hi @nahsivar. I'm not too familiar with pandas (but I should probably be), so I spent a few minutes playing around. Here are several ways to get what you want (I think):
#instantiate dataframe
import pandas as pd
from random import choice
x_choice = 'ABCDE'
y_choice = 'ABCDEFGHIJ'
x, y = [], []
for i in range(100):
x[i] = choice(x_choice)
y[i] = choice(y_choice)
df = pd.DataFrame(data={'col1': x, 'col2': y})
# 1
df.pivot_table(rows='col1', cols='col2', aggfunc=len)
# Use fill_value=0 to replace the NaNs with 0
# Output:
col2 A B C D E F G H I J
col1
A NaN 1 3 1 2 2 2 1 4 2
B 1 1 NaN 3 5 1 2 3 1 NaN
C 4 1 2 NaN NaN 4 3 2 1 2
D 2 2 2 1 1 3 3 4 4 2
E 1 1 1 NaN 4 2 6 3 2 2
# 2
df.groupby('col2')['col1'].value_counts().unstack(level=0)
# Output:
col2 A B C D E F G H I J
A NaN 1 3 1 2 2 2 1 4 2
B 1 1 NaN 3 5 1 2 3 1 NaN
C 4 1 2 NaN NaN 4 3 2 1 2
D 2 2 2 1 1 3 3 4 4 2
E 1 1 1 NaN 4 2 6 3 2 2
# 3
pd.crosstab(df.col1, df.col2)
# Output:
col2 A B C D E F G H I J
col1
A 0 1 3 1 2 2 2 1 4 2
B 1 1 0 3 5 1 2 3 1 0
C 4 1 2 0 0 4 3 2 1 2
D 2 2 2 1 1 3 3 4 4 2
E 1 1 1 0 4 2 6 3 2 2
Upvotes: 5