How to work with Doc2Vec and which approach is better training the model on my dataset or using a pretrained model?

Question

I am building a classification model for a dataset of items. Basically, I have 2 columns ex:

Item name	category
unsalted butter	dairy and eggs
cheese	dry grocery
peanut butter cream	dry grocery

I did the required preprocessing to clean the item name which is my input, one hot encoding for the category which is the target output, and I want to use KNN algorithm to classify the item name so I have to convert the item names to numbers.

I am struggling with the conversion model, I am not able to build the right model and check the word2vec accuracy results.

Would you please offer me a help in this since I am begginer in word embeddings technique?

I tried the following:

def tagged_document(text):
    for i, sent in enumerate(text):
        for j, word in enumerate(sent.split()):
            yield gensim.models.doc2vec.TaggedDocument(word, [j])
data_for_training = list(tagged_document(df['item_name']))
print(data_for_training[3])

Output: [TaggedDocument(words='peanut', tags=[0]), TaggedDocument(words='butter', tags=[1]), TaggedDocument(words='cream', tags=[2])]

model = gensim.models.doc2vec.Doc2Vec(size=150, window=4, min_count=2, workers=10, epochs=30)
model.build_vocab(data_for_training)
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)
model.save(model.bin)

print(model)
print(list(model.wv.vocab))

Output:

Doc2Vec(dm/m,d150,n5,w4,mc2,s0.001,t10) ['u', 'n', 's', 'a', 'l', 't', 'e', 'd', 'b', 'r', 'c', 'm', 'o', 'k', 'x', 'g', 'p', 'i', 'f', 'h', 'y', 'w', 'v', 'z', 'j', 'q', '7', '2', 'ü', '\x95', 'ñ', '1', '±', 'ç', '5', '4', '0', 'ã', 'ä', 'ù', 'ø', '8', '6', '²', '\x8a', 'ª', '\x82', '\x84', 'ð', '\x9f', '¥', '\x96', '§', '3', '\x91', '¯', '¬', '\xad', '¨', 'â', '\x80', '\x99', 'ï', '¿', '½', '\x93', '9', '©', '¢', '\x97', '\x94', '·', '\x88', '\x8d', '\x83', '\x98', '\x90', '®', 'å', 'é', '\x9d', 'æ', '¡', '¹', '´', '\x8c', '°', '¼', '\x87']

How to work with Doc2Vec and which approach is better training the model on my dataset or using a pretrained model?

Answers (1)

Related Questions