Reputation: 4200
I have a data frame like this
import pandas as pd
from gensim.corpora import Dictionary
tmp = pd.DataFrame({"word": [1, 0, 0, 0, 0, 0],
"house": [0, 1, 0, 0, 0, 0],
"tree": [0, 0, 1, 0, 0, 1], # occurred twice
"car": [0, 0, 0, 1, 0, 0],
"food": [0, 0, 0, 0, 1, 0],
"train": [0, 0, 0, 0, 0, 1]})
mydict = gensim.corpora.Dictionary()
from this, I want to create a gensim
corpus.
I have tried mycorp = [mydict.doc2bow(col, allow_update=True) for col in tmp.columns]
but the resulting corpus seems to not have been properly created:
TypeError: doc2bow expects an array of unicode tokens on input, not a single string
Can someone help me with this? I would like the resulting dictionary to represent the fact that word "tree" occurred twice in this data frame (i.e. the sum of the column).
Upvotes: 0
Views: 545
Reputation: 7174
The input to mydict.doc2bow
doesn't seem to be correct. It takes a list of strings, not a single string. The list of strings being the document.
If you consider each column name to be a document (i.e. document 1 is ["word"]
), then you could do:
[mydict.doc2bow([col], allow_update=True) for col in tmp.columns]
# [[(0, 1)], [(1, 1)], [(2, 1)], [(3, 1)], [(4, 1)], [(5, 1)]]
These are six documents (each sublist) with only a single word. The tuples in the sublist indicate the (word_id, frequency)
. So the first document contains word0
once. The second document contains word1
once, etc.
If you consider your column names to be a single document, then you could do:
mydict.doc2bow(tmp.columns, allow_update=True)
# [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]
Where your corpus consists of a single document, which contains word0
to word5
all once
Instead of working with strings ("tokens") directly, like "word", "house", etc, gensim
uses integers that represent a string. These integers are word ids. To see which word corresponds to which id, you can use:
mydict.token2id['word']
# 0
The bag of words is represented as a tuple with (word_id, frequency)
, because any given word may occur multiple times in a document. Especially in longer documents, a single word may appear 100 times.
Instead of saving a reference to that word a 100 times, gensim is clever and saves (word_id, 100)
instead. This then represents that some word occurs 100 times in a document.
Upvotes: 1