Mayar Alzerki
Mayar Alzerki

Reputation: 27

How to work with Doc2Vec and which approach is better training the model on my dataset or using a pretrained model?

I am building a classification model for a dataset of items. Basically, I have 2 columns ex:

Item name category
unsalted butter dairy and eggs
cheese dry grocery
peanut butter cream dry grocery

I did the required preprocessing to clean the item name which is my input, one hot encoding for the category which is the target output, and I want to use KNN algorithm to classify the item name so I have to convert the item names to numbers.

I am struggling with the conversion model, I am not able to build the right model and check the word2vec accuracy results.

Would you please offer me a help in this since I am begginer in word embeddings technique?

I tried the following:

def tagged_document(text):
    for i, sent in enumerate(text):
        for j, word in enumerate(sent.split()):
            yield gensim.models.doc2vec.TaggedDocument(word, [j])
data_for_training = list(tagged_document(df['item_name']))
print(data_for_training[3])

Output: [TaggedDocument(words='peanut', tags=[0]), TaggedDocument(words='butter', tags=[1]), TaggedDocument(words='cream', tags=[2])]

model = gensim.models.doc2vec.Doc2Vec(size=150, window=4, min_count=2, workers=10, epochs=30)
model.build_vocab(data_for_training)
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)
model.save(model.bin)

print(model)
print(list(model.wv.vocab))

Output:

Doc2Vec(dm/m,d150,n5,w4,mc2,s0.001,t10) ['u', 'n', 's', 'a', 'l', 't', 'e', 'd', 'b', 'r', 'c', 'm', 'o', 'k', 'x', 'g', 'p', 'i', 'f', 'h', 'y', 'w', 'v', 'z', 'j', 'q', '7', '2', 'ü', '\x95', 'ñ', '1', '±', 'ç', '5', '4', '0', 'ã', 'ä', 'ù', 'ø', '8', '6', '²', '\x8a', 'ª', '\x82', '\x84', 'ð', '\x9f', '¥', '\x96', '§', '3', '\x91', '¯', '¬', '\xad', '¨', 'â', '\x80', '\x99', 'ï', '¿', '½', '\x93', '9', '©', '¢', '\x97', '\x94', '·', '\x88', '\x8d', '\x83', '\x98', '\x90', '®', 'å', 'é', '\x9d', 'æ', '¡', '¹', '´', '\x8c', '°', '¼', '\x87']

Upvotes: 0

Views: 441

Answers (1)

gojomo
gojomo

Reputation: 54233

First and foremost, the words part of a TaggedDocument should be a list of words. By providing only a single word, it will be treated by Python as a list of single-character 'words'.

So when you supply...

TaggedDocument(tags=[0], words='peanut')

...that's equivalent to...

TaggedDocument(tags[0], words=['p', 'e', 'a', 'n', 'u', 't'])

That's why your final model has only single-character 'words' in it.

If in fact later you want to look-up Doc2Vec document-vectors by the 'Item name' values as look-up keys, you'll want to be sure your code instead creates TaggedDocuments more like:

TaggedDocument(tags=['unsalted butter'], words=['dairy', 'and', 'cream'])

On the other hand, if you want to look-up vectors by 'category' values as look-up keys, then you'll need the categories to be the tags:

TaggedDocument(tags=['dairy and cream'], words=['unsalted', 'butter'])

Which really depends on what you're trying to achieve – what data is supposed to halp you classify into which bins?

And, it's not clear Doc2Vec should be something helpful here, given the data you've shown & task you've described (classification).

Doc2Vec helps turn texts of many words into shorter summary vectors. It's usually demonstrated on texts that are at least as long as sentences, but possibly paragraphs, articles, or even full books. With single words, or short phrases of just a few words, it will have a much harder time learning/providing meaningful vectors.

Do you already have a classifier of any ype, even a poorly-performing one, working on this same data using simpler techniques, such the "bag-of-words" representations available through Scikit-Learn classes like CountVectorizer?

If not, I suggest doing that first, to achieve actual classification on a simpler and more typical base.

Only with that baseline in place, then you could consider using features derived from Word2Vec or Doc2Vec, to see if they help. Unless you have longer multi-word product descriptions, they might not.

Upvotes: 0

Related Questions