Cross-Lingual Word Sense Disambiguation

Question

I am a beginner in computer programming and I am completing an essay on Parallel Corpora in Word Sense Disambiguation. Basically, I intend to show that substituting a sense for a word translation simplifies the process of identifying the meaning of ambiguous words. I have already word-aligned my parallel corpus (EUROPARL English-Spanish) with GIZA++, but I don't know what to do with the output files. My intention is to build a classifier to calculate the probability of a translation word given the contextual features of the tokens which surround the ambiguous word in the source text. So, my question is: how do you extract instances of an ambiguous word from a parallel corpus WITH its aligned translation?

I have tried various scripts on Python, but these are run on the assumption that 1) the English and Spanish texts are in separate corpora and 2) the English and Spanish sentences share the same indexes, which obviously does not work. e.g.

def ambigu_word2(document, document2):
    words = ['letter']
    for sentences in document:
        tokens = word_tokenize(sentences)
        for item in tokens:
            x = w_lemma.lemmatize(item)
            for w in words:
                if w == x in sentences:
                    print (sentences, document2[document.index(sentences)])
print (ambigu_word2(raw1, raw2))

I would be really grateful if you could provide any guidance on this matter.

Cross-Lingual Word Sense Disambiguation

Answers (0)

Related Questions