krits
krits

Reputation: 68

Co-occurence matrix in python

I have huge data set, sample is below, and i need to compute the co-occurence matrix of the skill column, please refer to the sample data below, i read about the co-occurance matrix, and CountVectorizer from scikit learn shed light, i wrote the below code, but i am confused about how to see the results. If anyone can, then please help me, please find the sample data, and my tried code below

df1 = pd.DataFrame([["1000074", "6284 6295"],["75634786", "4044 4714 5789 6076 6077 6079 6082 6168 6229"],["75635714","4092 4420 4430 4437 4651"]], columns=['people_id', 'skills_id'])

count_vect = CountVectorizer(ngram_range=(1,1),lowercase= False)
X_counts = count_vect.fit_transform(df1['skills_id'])
Xc = (X_counts.T * X_counts)
Xc.setdiag(0)
print(Xc.todense())

I am pretty new to this terminology of co-occurence matrix with numbers, word-to-word co-occurence i can understand, but how toread and understand the result.

Upvotes: 0

Views: 776

Answers (1)

arnaud
arnaud

Reputation: 3483

Well you may think of it just like a word-to-word co-occurrence matrix. Here, assuming your skill column is that 2nd column with numbers of size 4, it first looks at all unique possible values :

>>> count_vect.get_feature_names()

['4044',
 '4092',
 '4420',
 '4430',
 '4437',
 '4651',
 '4714',
 '5789',
 '6076',
 '6077',
 '6079',
 '6082',
 '6168',
 '6229',
 '6284',
 '6295']

That's an array of size 16 which represents the 16 different words that were found in your skill column. Indeed, sklearn.text.CountVectorizer() finds the words by splitting your strings using space delimiter.

The final matrix you see using print(Xc.todense()) is just the co-occurrence matrix for these 16 words. That's why it is of size (16,16)

To make it clearer (please forgive the columns alignment formatting), you could look at :

>> pd.DataFrame(Xc.todense(), 
    columns=count_vect.get_feature_names(), 
    index=count_vect.get_feature_names())

      4044 4092 4420 ...
4044    0   0   0   0   0   0   1   1   1   1   1   1   1   1   0   0
4092    0   0   1   1   1   1   0   0   0   0   0   0   0   0   0   0
4420    0   1   0   1   1   1   0   0   0   0   0   0   0   0   0   0
4430    0   1   1   0   1   1   0   0   0   0   0   0   0   0   0   0
4437    0   1   1   1   0   1   0   0   0   0   0   0   0   0   0   0
4651    0   1   1   1   1   0   0   0   0   0   0   0   0   0   0   0
4714    1   0   0   0   0   0   0   1   1   1   1   1   1   1   0   0
5789    1   0   0   0   0   0   1   0   1   1   1   1   1   1   0   0
6076    1   0   0   0   0   0   1   1   0   1   1   1   1   1   0   0
6077    1   0   0   0   0   0   1   1   1   0   1   1   1   1   0   0
6079    1   0   0   0   0   0   1   1   1   1   0   1   1   1   0   0
6082    1   0   0   0   0   0   1   1   1   1   1   0   1   1   0   0
6168    1   0   0   0   0   0   1   1   1   1   1   1   0   1   0   0
6229    1   0   0   0   0   0   1   1   1   1   1   1   1   0   0   0
6284    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   1
6295    0   0   0   0   0   0   0   0   0   0   0   0   0   0   1   0

tl;dr In that case, as you input strings, whether they are numbers (e.g. "23") or nouns (e.g. "cat") doesn't change anything. The co-occurrence still displays binary values representing whether a given token is found with another one. The default tokenizer for CountVectorizer() is just splitting on spaces.

What exactly would you have expected differently with numbers ?

Upvotes: 1

Related Questions