Jialun Liu
Jialun Liu

Reputation: 331

Implement K Neighbors Classifier and Linear SVM in scikit-learn for Word sense disambiguiation

I am trying use the linear SVM and K Neighbors Classifier to do Word sense disambiguation(WSD). Here is a segment of data I am using to train the data:

<corpus lang="English">

<lexelt item="activate.v">


<instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
<context>
Do you know what it is ,  and where I can get one ?  We suspect you had seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly ,  you should n't have to bend your back during general digging ,  although it wo n't lift out the soil and put in a barrow if you need to move it !  If gardening tends to give you backache ,  remember to take plenty of rest periods during the day ,  and never try to lift more than you can easily cope with .  
</context>
</instance>


<instance id="activate.v.bnc.00044852" docsrc="BNC">
<answer instance="activate.v.bnc.00044852" senseid="38201"/>
<answer instance="activate.v.bnc.00044852" senseid="38202"/>
<context>
For neurophysiologists and neuropsychologists ,  the way forward in understanding perception has been to correlate these dimensions of experience with ,  firstly ,  the material properties of the experienced object or event  ( usually regarded as the stimulus )  and ,  secondly ,  the patterns of discharges in the sensory system .  Qualitative Aspects of Experience The quality or modality of the experience depends less upon the quality of energy reaching the nervous system than upon which parts of the sensory system are <head>activated</head> : stimulation of the retinal receptors causes an experience of light ; stimulation of the receptors in the inner ear gives rise to the experience of sound ; and so on . Muller 's  nineteenth - century  doctrine of specific energies  formalized the ordinary observation that different sense organs are sensitive to different physical properties of the world and that when they are stimulated ,  sensations specific to those organs are experienced .  It was proposed that there are endings  ( or receptors )  within the nervous system which are attuned to specific types of energy ,  For example ,  retinal receptors in the eye respond to light energy ,  cochlear endings in the ear to vibrations in the air ,  and so on .  
</context>
</instance>
.....

The difference between training and test data is that test data don't have the "answer" tag. I have built a dictionary to store the words that are neighbors of the "head" word for each instance with a window size of 10. When there are multiple for one instance, I am only going to consider the first one. I have also built a set to record all the vocabulary in the training file so that I could compute a vector for each instance. For example, if the total vocabulary is [a,b,c,d,e], and one instance has words [a,a,d,d,e], the resulting vector for that instance would be [2,0,0,2,1]. Here's the a segment of the dictionary I built for each word:

{
    "activate.v": {
        "activate.v.bnc.00024693": {
            "instanceId": "activate.v.bnc.00024693", 
            "senseId": "38201", 
            "vocab": {
                "although": 1, 
                "back": 1, 
                "bend": 1, 
                "bicycl": 1, 
                "correct": 1, 
                "dig": 1, 
                "general": 1, 
                "handlebar": 1, 
                "hefti": 1, 
                "lever": 1, 
                "nt": 2, 
                "quit": 1, 
                "rear": 1, 
                "spade": 1, 
                "sprung": 1, 
                "step": 1, 
                "type": 1, 
                "use": 1, 
                "wo": 1
            }
        }, 
        "activate.v.bnc.00044852": {
            "instanceId": "activate.v.bnc.00044852", 
            "senseId": "38201", 
            "vocab": {
                "caus": 1, 
                "ear": 1, 
                "energi": 1, 
                "experi": 1, 
                "inner": 1, 
                "light": 1, 
                "nervous": 1, 
                "part": 1, 
                "qualiti": 1, 
                "reach": 1, 
                "receptor": 2, 
                "retin": 1, 
                "sensori": 1, 
                "stimul": 2, 
                "system": 2, 
                "upon": 2
            }
        }, 
        ......

Now, I just need to provide the input to K Neighbors Classifier and Linear SVM from the scikit-learn to train the classifier. But I am just not sure how should I build the feature vector and label for each. My understanding is that label should be a tuple of instance tag and senseid tag in the "answer". But I am not sure about the feature vector then. Should I group all the vectors from the same word that has the same instance tag and senseid tag in the "answer"? But there are around 100 words and hundreds of instances for each word, how am I supposed to deal with that?

Also, vector is one feature, I need to add more features later on, for example, synset, hypernyms, hyponyms etc. How am I supposed to do that?

Thanks in advance!

Upvotes: 5

Views: 604

Answers (2)

Eugene Lisitsky
Eugene Lisitsky

Reputation: 12845

Next step - implement multidimentional linear classifier.

Unfortunately I have no access to this database, so this is a little bit theoretical. I can propose this approach:

Coerce all the data in one CSV-file like this:

SenseId,Word,Text,IsHyponim,Properties,Attribute1,Attribute2, ...
30821,"BNC","For neurophysiologists and ...","Hyponym sometype",1,1
30822,"BNC","Do you know what it is ...","Antonym type",0,1
...

Next you may use sklearn tools:

import pandas as pd
df.read_csv('file.csv')

from sklearn.feature_extraction import DictVectorizer
enc=DictVectorizer()
X_train_categ = enc.fit_transform(df[['Properties',]].to_dict('records'))

from sklearn.feature_extraction.text import TfidfVectorizer
vec=TfidfVectorizer(min_df=5)  # throw out all terms which present in less than 5 documents - typos and so on
v=vec.fit_transform(df['Text'])

# Join all date together as a sparsed matrix
from scipy.sparse import csr_matrix, hstack
train=hstack( (csr_matrix(df.ix[:, 'Word':'Text']), X_train_categ, v))
y = df['SenseId']

# here you have an matrix with really huge dimensionality - about dozens of thousand columns 
# you may use Ridge regression to deal with it:
from sklearn.linear_model import Ridge
r=Ridge(random_state=241, alpha=1.0)

# prepare test data like training one

More details on: Ridge, Ridge Classifier.

Other techniques to deal with high dimensionality problem.

Code example on text classification using sparse feature matrices.

Upvotes: 2

Eugene Lisitsky
Eugene Lisitsky

Reputation: 12845

Machine Learning problems are a kind of optimization tasks, where you have no predefined best-for-all algorithm, but rather grope for the best result using different approaches, parameters and data preprocessing. So you are absolutely right starting from the easiest task - taking only one word and several senses of it.

But I am just not sure how should I build the feature vector and label for each.

You can take just these values as a vector components. Enumerate vector words and write numbers of such word in each text. If the word is absent, put empty value. I've slightly modified your example to clarify the idea:

vocab_38201= {
            "although": 1, 
            "back": 1, 
            "bend": 1, 
            "bicycl": 1, 
            "correct": 1, 
            "dig": 1, 
            "general": 1, 
            "handlebar": 1, 
            "hefti": 1, 
            "lever": 1, 
            "nt": 2, 
            "quit": 1, 
            "rear": 1, 
            "spade": 1, 
            "sprung": 1, 
            "step": 1, 
            "type": 1, 
            "use": 1, 
            "wo": 1
        }

vocab_38202 = {
            "caus": 1, 
            "ear": 1, 
            "energi": 1, 
            "experi": 1, 
            "inner": 1, 
            "light": 1, 
            "nervous": 1, 
            "part": 1, 
            "qualiti": 1, 
            "reach": 1, 
            "receptor": 2, 
            "retin": 1, 
            "sensori": 1, 
            "stimul": 2, 
            "system": 2, 
            "upon": 2,
            "wo": 1     ### added so they have at least one common word
        }

Let's covert it to feature vector. Enumerate all the word and mark how many times this word is in the vocabulary.

from collections import defaultdict
words = []

def get_components(vect_dict):
    vect_components = defaultdict(int)
    for word, num in vect_dict.items():
        try:
           ind = words.index(word)
        except ValueError:
           ind = len(words)
           words.append(word)
        vect_components[ind] += num
    return vect_components


#  
vect_comps_38201 = get_components(vocab_38201)
vect_comps_38202 = get_components(vocab_38202)

Lets look:

>>> print(vect_comps_38201)
defaultdict(<class 'int'>, {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 1, 17: 1, 18: 1})

>>> print(vect_comps_38202)
defaultdict(<class 'int'>, {32: 1, 33: 2, 34: 1, 7: 1, 19: 2, 20: 2, 21: 1, 22: 1, 23: 1, 24: 1, 25: 1, 26: 1, 27: 2, 28: 1, 29: 1, 30: 1, 31: 1})

>>> vect_38201=[vect_comps_38201.get(i,0) for i in range(len(words))]
>>> print(vect_38201)
[1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

>>> vect_38202=[vect_comps_38202.get(i,0) for i in range(len(words))]
>>> print(vect_38202)
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1]

These vect_38201 and vect38202 are vectors you may use in fitting model:

from sklearn.svm import SVC
X = [vect_38201, vect_38202]
y = [38201, 38202]
clf = SVC()
clf.fit(X, y)
clf.predict([[0, 0, 1, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 2, 1]])

Output:

array([38202])

Of course it is a very very simple example just show the concept.

What can you do to improve it?

  1. Normalize vector coordinates.

  2. Use excellent tool Tf-Idf vectorizer to extract data features from text.

  3. Add more data.

Good luck!

Upvotes: 2

Related Questions