Conceptnet Numberbatch (multilingual) OOV words

Question

I'm working on a text classification problem (on a French corpus) and I'm experimenting with different Word Embeddings. I was very interested in what ConceptNet has to offer so I decided to give it a shot.

I wasn't able to find a dedicated tutorial for my particular task, so I took the advice from their blog:

How do I use ConceptNet Numberbatch?

To make it as straightforward as possible:

Work through any tutorial on machine learning for NLP that uses semantic vectors. Get to the part where they tell you to use word2vec. (A particularly enlightened tutorial may tell you to use GloVe 1.2.)

Get the ConceptNet Numberbatch data, and use it instead. Get better results that also generalize to other languages.

Below you may find my approach (note that 'numberbatch.txt' is the file containing the recommended multilingual version: ConceptNet Numberbatch 19.08):

embeddings_index = dict()

f = open('numberbatch.txt')

for line in f:
    values = line.split()
    word = values[0]
    coefs = asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Loaded %s word vectors.' % len(embeddings_index))

I started by testing whether a word exists:

word = 'fille'
missingWords = 0
if word not in embeddings_index:
    missingWords += 1
print(missingWords)

I found surprising that a simple word like 'fille' (girl in French) is not found. I then created a function for printing all the OOV words from my corpus. I was even more surprised when analyzing the results: over 22k of the words weren't found (including words such as 'nous'(we), 'être'(to be), etc.).

I also tried the approach proposed on the GitHub page for the OOV words (with the same result):

Out-of-vocabulary strategy

ConceptNet Numberbatch is evaluated with an out-of-vocabulary strategy that helps its performance in the presence of unfamiliar words. The strategy is implemented in the ConceptNet code base. It can be summarized as follows:

Given an unknown word whose language is not English, try looking up the equivalently-spelled word in the English embeddings (because English words tend to end up in text of all languages).

Given an unknown word, remove a letter from the end, and see if that is a prefix of known words. If so, average the embeddings of those known words.

If the prefix is still unknown, continue removing letters from the end until a known prefix is found. Give up when a single character remains.

Am I doing something wrong in my approach?

jlowryduda · Accepted Answer

Are you taking into account ConceptNet Numberbatch's format? As shown in the project's GitHub, it looks like this:

/c/en/absolute_value -0.0847 -0.1316 -0.0800 -0.0708 -0.2514 -0.1687 -...

/c/en/absolute_zero 0.0056 -0.0051 0.0332 -0.1525 -0.0955 -0.0902 0.07...

This format means that fille will not be found, but /c/fr/fille will.

Conceptnet Numberbatch (multilingual) OOV words

Answers (1)

Related Questions