Reputation: 773
I'm running the lda library in Python and I am running this example. Does anyone know the format of X, vocab and titles? I can't find the documentation.
import numpy as np
import lda
X = lda.datasets.load_reuters()
vocab = lda.datasets.load_reuters_vocab()
titles = lda.datasets.load_reuters_titles()
Upvotes: 3
Views: 1030
Reputation: 827
X is a matrix where the rows are titles, and and columns are vocab. It is a bag of word representation of the title's text.
X
Out[8]:
array([[1, 0, 1, ..., 0, 0, 0],
[7, 0, 2, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[1, 0, 1, ..., 0, 0, 0],
[1, 0, 1, ..., 0, 0, 0],
[1, 0, 1, ..., 0, 0, 0]], dtype=int32)
In the above matrix each row is a bag of word represtation of individual titles. Each column represents a specific word example.
vocab[:5]
Out[5]: ('church', 'pope', 'years', 'people', 'mother')
So, each row i, col j in the X matrix gives the frequency of specific word in the ith title.
titles[:1]
Out[11]: ('0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20',)
The title U: Prince Charles ... mentions the word church once, pope 0 times, years once, and so on.
In [13]: type(titles)
Out[13]: tuple
In [14]: type(vocab)
Out[14]: tuple
In [15]: type(X)
Out[15]: numpy.ndarray
Upvotes: 7