Creating a co-occurence matrix

Question

    | 0                 | 1                 | 2                 | 3
_______________________________________________________________________________
|0  | (-1.774, 1.145]   | (-3.21, 0.533]    |(0.0166, 2.007]    | (2.0, 3.997]
_______________________________________________________________________________
|1  | (-1.774, 1.145]   | (-3.21, 0.533]    | (2.007, 3.993]    | (2.0, 3.997]
_______________________________________________________________________________

I am trying to create a co-occurrence matrix of data set like above having 800 records and 12 categorical variables. I am trying to create co-occurrence matrix of every category from each variable to every category from other variables

Akshay Sehgal · Accepted Answer

You can do this in a straight-forward way using OneHotEncoder() and np.dot()

Turn each element in dataframe to a string
Use a one-hot encoder to convert the dataframe into one-hots over a unique vocabulary of the categorical elements
Take a dot product with itself to get count of co-occurance
Recreate a dataframe using the co-occurance matrix and the feature_names from the one hot encoder

#assuming this is your dataset
                 0               1                2             3
0  (-1.774, 1.145]  (-3.21, 0.533]  (0.0166, 2.007]  (2.0, 3.997]
1  (-1.774, 1.145]  (-3.21, 0.533]   (2.007, 3.993]  (2.0, 3.997]

from sklearn.preprocessing import OneHotEncoder

df = df.astype(str) #turn each element to string

#get one hot representation of the dataframe
l = OneHotEncoder() 
data = l.fit_transform(df.values)

#get co-occurance matrix using a dot product
co_occurance = np.dot(data.T, data)

#get vocab (columns and indexes) for co-occuance matrix
#get_feature_names() has a weird suffix which I am removing for better readibility here
vocab = [i[3:] for i in l.get_feature_names()]

#create co-occurance matrix
ddf = pd.DataFrame(co_occurance.todense(), columns=vocab, index=vocab)
print(ddf)

                 (-1.774, 1.145]  (-3.21, 0.533]  (0.0166, 2.007]  \
(-1.774, 1.145]              2.0             2.0              1.0   
(-3.21, 0.533]               2.0             2.0              1.0   
(0.0166, 2.007]              1.0             1.0              1.0   
(2.007, 3.993]               1.0             1.0              0.0   
(2.0, 3.997]                 2.0             2.0              1.0   

                 (2.007, 3.993]  (2.0, 3.997]  
(-1.774, 1.145]             1.0           2.0  
(-3.21, 0.533]              1.0           2.0  
(0.0166, 2.007]             0.0           1.0  
(2.007, 3.993]              1.0           1.0  
(2.0, 3.997]                1.0           2.0

As you can verify from the output above, its exactly what the co-occurance matrix should be.

Advantages of this approach are that you can scale this using the transform method of the one-hot encoder object and most of the processing happens in sparse matrices until the final step of creating the dataframe so its memory efficient.

Creating a co-occurence matrix

Answers (2)

Related Questions