Reputation: 970
I'm constructing a recurrent neural network with PyBrain for text classification problems. After numerous attempts I still can't sort out how to convert a list of strings into an array that can be used as a dataset. What I did:
import collections,re
from pybrain.datasets import SupervisedDataSet
#create the supervised dataset variable with 5 inputs and 1 output
windowSize=5
main_ds = SupervisedDataSet(windowSize,1)
with open('ltest5lg_d1.fr','r') as train_1:
import_data_train=train_1.readlines()
train_data = []
for lines in import_data_train:
s = lines.split()
for words in s:
train_data.append(words)
bagsofwords = [collections.Counter(re.findall(r'\w+', txt)) for txt in train_data]
sumbags = sum(bagsofwords, collections.Counter())
So I got the frequency table for the training data, but I can't sort out how to convert the data itself to some format that could be used as an input in main_ds variable.
Upvotes: 1
Views: 191
Reputation: 7742
The standard way of representing words in the context of learning is the word embeddings model.
What you want (and this is with only a cursory glance at PyBrain's dataset page [1]) is to build a dataset by converting the text into their vector representations.
For an example of how to do it yourself, see glove-python [2]. If you'd like to use an existing package to do this, see Google's word2vec [3] or Stanford's GloVe [4], of which the python version is a naive implementation.
You could then use this representation to train your NN.
[1] http://pybrain.org/docs/quickstart/dataset.html
[2] https://github.com/maciejkula/glove-python
[3] https://code.google.com/p/word2vec/
[4] http://www-nlp.stanford.edu/projects/glove/
Upvotes: 1