Rahul_Patel
Rahul_Patel

Reputation: 13

Creating a co-occurence matrix

    | 0                 | 1                 | 2                 | 3
_______________________________________________________________________________
|0  | (-1.774, 1.145]   | (-3.21, 0.533]    |(0.0166, 2.007]    | (2.0, 3.997]
_______________________________________________________________________________
|1  | (-1.774, 1.145]   | (-3.21, 0.533]    | (2.007, 3.993]    | (2.0, 3.997]
_______________________________________________________________________________

I am trying to create a co-occurrence matrix of data set like above having 800 records and 12 categorical variables. I am trying to create co-occurrence matrix of every category from each variable to every category from other variables

Upvotes: 0

Views: 379

Answers (2)

Akshay Sehgal
Akshay Sehgal

Reputation: 19307

You can do this in a straight-forward way using OneHotEncoder() and np.dot()

  1. Turn each element in dataframe to a string
  2. Use a one-hot encoder to convert the dataframe into one-hots over a unique vocabulary of the categorical elements
  3. Take a dot product with itself to get count of co-occurance
  4. Recreate a dataframe using the co-occurance matrix and the feature_names from the one hot encoder
#assuming this is your dataset
                 0               1                2             3
0  (-1.774, 1.145]  (-3.21, 0.533]  (0.0166, 2.007]  (2.0, 3.997]
1  (-1.774, 1.145]  (-3.21, 0.533]   (2.007, 3.993]  (2.0, 3.997]
from sklearn.preprocessing import OneHotEncoder

df = df.astype(str) #turn each element to string

#get one hot representation of the dataframe
l = OneHotEncoder() 
data = l.fit_transform(df.values)

#get co-occurance matrix using a dot product
co_occurance = np.dot(data.T, data)

#get vocab (columns and indexes) for co-occuance matrix
#get_feature_names() has a weird suffix which I am removing for better readibility here
vocab = [i[3:] for i in l.get_feature_names()]

#create co-occurance matrix
ddf = pd.DataFrame(co_occurance.todense(), columns=vocab, index=vocab)
print(ddf)
                 (-1.774, 1.145]  (-3.21, 0.533]  (0.0166, 2.007]  \
(-1.774, 1.145]              2.0             2.0              1.0   
(-3.21, 0.533]               2.0             2.0              1.0   
(0.0166, 2.007]              1.0             1.0              1.0   
(2.007, 3.993]               1.0             1.0              0.0   
(2.0, 3.997]                 2.0             2.0              1.0   

                 (2.007, 3.993]  (2.0, 3.997]  
(-1.774, 1.145]             1.0           2.0  
(-3.21, 0.533]              1.0           2.0  
(0.0166, 2.007]             0.0           1.0  
(2.007, 3.993]              1.0           1.0  
(2.0, 3.997]                1.0           2.0  

As you can verify from the output above, its exactly what the co-occurance matrix should be.

Advantages of this approach are that you can scale this using the transform method of the one-hot encoder object and most of the processing happens in sparse matrices until the final step of creating the dataframe so its memory efficient.

Upvotes: 1

Surya Narayanan
Surya Narayanan

Reputation: 455

Suppose your data is in a data frame df.

Then, you can do so with 2 loops over the data frame and two loops over each row of the data frame as follows:

from collections import defaultdict

co_occrence = defaultdict(int)    
for index, row in df.iterrows(): 
    for index2, row2 in df.iloc[index + 1:].iterrows():
        for row_index, feature in enumerate(row):
            for feature2 in row2[row_index + 1:]:
              co_occrence[feature, feature2] += 1

Upvotes: 0

Related Questions