Reputation: 27
I am building a classification model for a dataset of items. Basically, I have 2 columns ex:
Item name | category |
---|---|
unsalted butter | dairy and eggs |
cheese | dry grocery |
peanut butter cream | dry grocery |
I did the required preprocessing to clean the item name which is my input, one hot encoding for the category which is the target output, and I want to use KNN algorithm to classify the item name so I have to convert the item names to numbers.
I am struggling with the conversion model, I am not able to build the right model and check the word2vec accuracy results.
Would you please offer me a help in this since I am begginer in word embeddings technique?
I tried the following:
def tagged_document(text):
for i, sent in enumerate(text):
for j, word in enumerate(sent.split()):
yield gensim.models.doc2vec.TaggedDocument(word, [j])
data_for_training = list(tagged_document(df['item_name']))
print(data_for_training[3])
Output: [TaggedDocument(words='peanut', tags=[0]), TaggedDocument(words='butter', tags=[1]), TaggedDocument(words='cream', tags=[2])]
model = gensim.models.doc2vec.Doc2Vec(size=150, window=4, min_count=2, workers=10, epochs=30)
model.build_vocab(data_for_training)
model.train(data_for_training, total_examples=model.corpus_count, epochs=model.epochs)
model.save(model.bin)
print(model)
print(list(model.wv.vocab))
Output:
Doc2Vec(dm/m,d150,n5,w4,mc2,s0.001,t10) ['u', 'n', 's', 'a', 'l', 't', 'e', 'd', 'b', 'r', 'c', 'm', 'o', 'k', 'x', 'g', 'p', 'i', 'f', 'h', 'y', 'w', 'v', 'z', 'j', 'q', '7', '2', 'ü', '\x95', 'ñ', '1', '±', 'ç', '5', '4', '0', 'ã', 'ä', 'ù', 'ø', '8', '6', '²', '\x8a', 'ª', '\x82', '\x84', 'ð', '\x9f', '¥', '\x96', '§', '3', '\x91', '¯', '¬', '\xad', '¨', 'â', '\x80', '\x99', 'ï', '¿', '½', '\x93', '9', '©', '¢', '\x97', '\x94', '·', '\x88', '\x8d', '\x83', '\x98', '\x90', '®', 'å', 'é', '\x9d', 'æ', '¡', '¹', '´', '\x8c', '°', '¼', '\x87']
Upvotes: 0
Views: 441
Reputation: 54233
First and foremost, the words
part of a TaggedDocument
should be a list of words. By providing only a single word, it will be treated by Python as a list of single-character 'words'.
So when you supply...
TaggedDocument(tags=[0], words='peanut')
...that's equivalent to...
TaggedDocument(tags[0], words=['p', 'e', 'a', 'n', 'u', 't'])
That's why your final model has only single-character 'words' in it.
If in fact later you want to look-up Doc2Vec
document-vectors by the 'Item name' values as look-up keys, you'll want to be sure your code instead creates TaggedDocument
s more like:
TaggedDocument(tags=['unsalted butter'], words=['dairy', 'and', 'cream'])
On the other hand, if you want to look-up vectors by 'category' values as look-up keys, then you'll need the categories to be the tags:
TaggedDocument(tags=['dairy and cream'], words=['unsalted', 'butter'])
Which really depends on what you're trying to achieve – what data is supposed to halp you classify into which bins?
And, it's not clear Doc2Vec
should be something helpful here, given the data you've shown & task you've described (classification).
Doc2Vec
helps turn texts of many words into shorter summary vectors. It's usually demonstrated on texts that are at least as long as sentences, but possibly paragraphs, articles, or even full books. With single words, or short phrases of just a few words, it will have a much harder time learning/providing meaningful vectors.
Do you already have a classifier of any ype, even a poorly-performing one, working on this same data using simpler techniques, such the "bag-of-words" representations available through Scikit-Learn classes like CountVectorizer
?
If not, I suggest doing that first, to achieve actual classification on a simpler and more typical base.
Only with that baseline in place, then you could consider using features derived from Word2Vec
or Doc2Vec
, to see if they help. Unless you have longer multi-word product descriptions, they might not.
Upvotes: 0