Ivo
Ivo

Reputation: 4200

gensim corpus from sparse matrix

I have a data frame like this

import pandas as pd
from gensim.corpora import Dictionary

tmp = pd.DataFrame({"word":  [1, 0, 0, 0, 0, 0],
                    "house": [0, 1, 0, 0, 0, 0],
                    "tree":  [0, 0, 1, 0, 0, 1], # occurred twice
                    "car":   [0, 0, 0, 1, 0, 0],
                    "food":  [0, 0, 0, 0, 1, 0],
                    "train": [0, 0, 0, 0, 0, 1]})
mydict = gensim.corpora.Dictionary()

from this, I want to create a gensim corpus.

I have tried mycorp = [mydict.doc2bow(col, allow_update=True) for col in tmp.columns] but the resulting corpus seems to not have been properly created:

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

Can someone help me with this? I would like the resulting dictionary to represent the fact that word "tree" occurred twice in this data frame (i.e. the sum of the column).

Upvotes: 0

Views: 545

Answers (1)

KenHBS
KenHBS

Reputation: 7174

The input to mydict.doc2bow doesn't seem to be correct. It takes a list of strings, not a single string. The list of strings being the document.

Scenario 1

If you consider each column name to be a document (i.e. document 1 is ["word"]), then you could do:

[mydict.doc2bow([col], allow_update=True) for col in tmp.columns]
# [[(0, 1)], [(1, 1)], [(2, 1)], [(3, 1)], [(4, 1)], [(5, 1)]]

These are six documents (each sublist) with only a single word. The tuples in the sublist indicate the (word_id, frequency). So the first document contains word0 once. The second document contains word1 once, etc.

Scenario 2

If you consider your column names to be a single document, then you could do:

mydict.doc2bow(tmp.columns, allow_update=True) 
# [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]

Where your corpus consists of a single document, which contains word0 to word5 all once

Little bit of background

Instead of working with strings ("tokens") directly, like "word", "house", etc, gensim uses integers that represent a string. These integers are word ids. To see which word corresponds to which id, you can use:

mydict.token2id['word']
# 0

The bag of words is represented as a tuple with (word_id, frequency), because any given word may occur multiple times in a document. Especially in longer documents, a single word may appear 100 times.

Instead of saving a reference to that word a 100 times, gensim is clever and saves (word_id, 100) instead. This then represents that some word occurs 100 times in a document.

Upvotes: 1

Related Questions