Reputation: 3001
I have a dataframe with boolean columns each indicating whether a record belongs to a category:
import pandas as pd
example = pd.DataFrame({
"is_a": [True, False, True, True],
"is_b": [False, False, False, True],
"is_c": [True, False, False, True],
})
example:
is_a is_b is_c
0 True False True
1 False False False
2 True False False
3 True True True
I want to count the number of co-occurrences between each pair of categories. I'm currently doing this:
cols = ["is_a", "is_b", "is_c"]
output = pd.DataFrame(
{x: [(example[x] & example[y]).sum() for y in cols] for x in cols},
index=cols,
)
output:
is_a is_b is_c
is_a 3 1 2
is_b 1 1 1
is_c 2 1 2
This gives me the right output, but I'm wondering if anyone thinks they've found a better solution for this problem.
Upvotes: 4
Views: 100
Reputation: 323226
I will use numpy
broadcasting
s=example.values.T
np.sum(s&s[:,None],2)
array([[3, 1, 2],
[1, 1, 1],
[2, 1, 2]])
Convert to data frame
pd.DataFrame(np.sum(s&s[:,None],2),columns=example.columns,index=example.columns)
is_a is_b is_c
is_a 3 1 2
is_b 1 1 1
is_c 2 1 2
Upvotes: 1
Reputation: 1525
We can use matrix multiplication to solve this.
import numpy as np
import pandas as pd
example = pd.DataFrame({
"is_a": [True, False, True, True],
"is_b": [False, False, False, True],
"is_c": [True, False, False, True],
})
encoded_example = example.astype(int)
output = pd.DataFrame(
np.dot(encoded_example.T, encoded_example),
index=encoded_example.columns,
columns=encoded_example.columns
)
Upvotes: 1
Reputation: 294218
dot
This is the Pandas method pandas.DataFrame.dot
method using the @
operator.
(lambda d: d.T @ d)(example.astype(int))
is_a is_b is_c
is_a 3 1 2
is_b 1 1 1
is_c 2 1 2
Same thing but using ndarray
instead
a = example.to_numpy().astype(int)
pd.DataFrame(a.T @ a, example.columns, example.columns)
is_a is_b is_c
is_a 3 1 2
is_b 1 1 1
is_c 2 1 2
Upvotes: 6