Reputation: 13
| 0 | 1 | 2 | 3
_______________________________________________________________________________
|0 | (-1.774, 1.145] | (-3.21, 0.533] |(0.0166, 2.007] | (2.0, 3.997]
_______________________________________________________________________________
|1 | (-1.774, 1.145] | (-3.21, 0.533] | (2.007, 3.993] | (2.0, 3.997]
_______________________________________________________________________________
I am trying to create a co-occurrence matrix of data set like above having 800 records and 12 categorical variables. I am trying to create co-occurrence matrix of every category from each variable to every category from other variables
Upvotes: 0
Views: 379
Reputation: 19307
You can do this in a straight-forward way using OneHotEncoder()
and np.dot()
feature_names
from the one hot encoder#assuming this is your dataset
0 1 2 3
0 (-1.774, 1.145] (-3.21, 0.533] (0.0166, 2.007] (2.0, 3.997]
1 (-1.774, 1.145] (-3.21, 0.533] (2.007, 3.993] (2.0, 3.997]
from sklearn.preprocessing import OneHotEncoder
df = df.astype(str) #turn each element to string
#get one hot representation of the dataframe
l = OneHotEncoder()
data = l.fit_transform(df.values)
#get co-occurance matrix using a dot product
co_occurance = np.dot(data.T, data)
#get vocab (columns and indexes) for co-occuance matrix
#get_feature_names() has a weird suffix which I am removing for better readibility here
vocab = [i[3:] for i in l.get_feature_names()]
#create co-occurance matrix
ddf = pd.DataFrame(co_occurance.todense(), columns=vocab, index=vocab)
print(ddf)
(-1.774, 1.145] (-3.21, 0.533] (0.0166, 2.007] \
(-1.774, 1.145] 2.0 2.0 1.0
(-3.21, 0.533] 2.0 2.0 1.0
(0.0166, 2.007] 1.0 1.0 1.0
(2.007, 3.993] 1.0 1.0 0.0
(2.0, 3.997] 2.0 2.0 1.0
(2.007, 3.993] (2.0, 3.997]
(-1.774, 1.145] 1.0 2.0
(-3.21, 0.533] 1.0 2.0
(0.0166, 2.007] 0.0 1.0
(2.007, 3.993] 1.0 1.0
(2.0, 3.997] 1.0 2.0
As you can verify from the output above, its exactly what the co-occurance matrix should be.
Advantages of this approach are that you can scale this using the transform
method of the one-hot encoder object and most of the processing happens in sparse matrices until the final step of creating the dataframe so its memory efficient.
Upvotes: 1
Reputation: 455
Suppose your data is in a data frame df.
Then, you can do so with 2 loops over the data frame and two loops over each row of the data frame as follows:
from collections import defaultdict
co_occrence = defaultdict(int)
for index, row in df.iterrows():
for index2, row2 in df.iloc[index + 1:].iterrows():
for row_index, feature in enumerate(row):
for feature2 in row2[row_index + 1:]:
co_occrence[feature, feature2] += 1
Upvotes: 0