dcann
dcann

Reputation: 1

Why won't vectors derived from the pre-trained fasttext Japanese wiki model align properly with English vectors?

I'm trying to align English word vectors taken from the word2vec model trained on Google news with Japanese language word vectors taken from two different models: the fasttext model pre-trained on wikipedia, and the fasttext model pre-trained on common crawl.

I was able to extract the vectors without issue, all from the .bin files.

All vectors are dimension 300.

Alignment of the vectors is done using Procrustes transformation in Python with the scipy library, like so:

from scipy.linalg import orthogonal_procrustes

# Compute the transformation matrix
R, _ = orthogonal_procrustes(japanese_vectors, english_vectors)

# Apply transformation to Japanese vectors
aligned_japanese_vectors = japanese_vectors @ R  # Matrix multiplication

The issue is not with the code I don't think, but with the vectors themselves; specifically those taken from the fasttext wiki model. The vectors simply don't align in the expected way.

The vectors are aligned using cosine similiarity, this time in numpy, with the following bit of code:

from numpy.linalg import norm

# Function to compute cosine similarity
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (norm(v1) * norm(v2))

# Compute cosine similarity for each word pair
cosine_similarities = [
    (english_words[i], japanese_words[i], cosine_similarity(english_vectors[i], aligned_japanese_vectors[i]))
    for i in range(len(english_words))
] 

When aligning the English vectors with the Japanese common crawl vectors, the inter-language alignments are ~.80-.90, which is what's expected. Alignments between the English vectors and the Japanese vectors from the fasttext wiki model are ~.4-.5. Pearson's correlation between the common crawl alignments and the wiki alignments are only ~.45, which tells me something is way off.

When I inspect the vectors themselves, the English vectors are all <1, as are the Japanese commmon crawl vectors. The Japanese vectors taken from the wiki models are all >1.

I compared the vectors from the .bin files to the vectors from the .txt files. English vectors and Japanese common crawl vectors looked more or less the same between the .bin and .txt files. Japanese wiki-model word vectors are dissimilar between the .bin and .txt files.

I'm at a loss. Any help is much appreciated.

Upvotes: 0

Views: 16

Answers (0)

Related Questions