Better way to compute co-occurrences in Pandas

I have a dataframe with boolean columns each indicating whether a record belongs to a category:

import pandas as pd

example = pd.DataFrame({
    "is_a": [True, False, True, True],
    "is_b": [False, False, False, True],
    "is_c": [True, False, False, True],
})

example:

    is_a    is_b    is_c
0   True    False   True
1   False   False   False
2   True    False   False
3   True    True    True

I want to count the number of co-occurrences between each pair of categories. I'm currently doing this:

cols = ["is_a", "is_b", "is_c"]
output = pd.DataFrame(
    {x: [(example[x] & example[y]).sum() for y in cols] for x in cols},
    index=cols,
)

output:

     is_a is_b is_c
is_a    3    1    2
is_b    1    1    1
is_c    2    1    2

This gives me the right output, but I'm wondering if anyone thinks they've found a better solution for this problem.

Upvotes: 4

Views: 100

Answers (3)

BENY
BENY

Reputation: 323226

I will use numpy broadcasting

s=example.values.T
np.sum(s&s[:,None],2)
array([[3, 1, 2],
       [1, 1, 1],
       [2, 1, 2]])

Convert to data frame

pd.DataFrame(np.sum(s&s[:,None],2),columns=example.columns,index=example.columns)
      is_a  is_b  is_c
is_a     3     1     2
is_b     1     1     1
is_c     2     1     2

Upvotes: 1

Kyle Parsons
Kyle Parsons

Reputation: 1525

We can use matrix multiplication to solve this.

import numpy as np
import pandas as pd

example = pd.DataFrame({
    "is_a": [True, False, True, True],
    "is_b": [False, False, False, True],
    "is_c": [True, False, False, True],
})

encoded_example = example.astype(int)

output = pd.DataFrame(
    np.dot(encoded_example.T, encoded_example),
    index=encoded_example.columns,
    columns=encoded_example.columns
)

Upvotes: 1

piRSquared
piRSquared

Reputation: 294218

dot

This is the Pandas method pandas.DataFrame.dot method using the @ operator.

(lambda d: d.T @ d)(example.astype(int))

      is_a  is_b  is_c
is_a     3     1     2
is_b     1     1     1
is_c     2     1     2

Same thing but using ndarray instead

a = example.to_numpy().astype(int)
pd.DataFrame(a.T @ a, example.columns, example.columns)

      is_a  is_b  is_c
is_a     3     1     2
is_b     1     1     1
is_c     2     1     2

Upvotes: 6

Related Questions