Text conversion in python

Question

I'm constructing a recurrent neural network with PyBrain for text classification problems. After numerous attempts I still can't sort out how to convert a list of strings into an array that can be used as a dataset. What I did:

import collections,re
from pybrain.datasets import SupervisedDataSet

#create the supervised dataset variable with 5 inputs and 1 output
windowSize=5
main_ds = SupervisedDataSet(windowSize,1)

with open('ltest5lg_d1.fr','r') as train_1:
        import_data_train=train_1.readlines()

train_data = []

for lines in import_data_train:
    s = lines.split()        
    for words in s:
        train_data.append(words)

bagsofwords = [collections.Counter(re.findall(r'\w+', txt)) for txt in train_data]

sumbags = sum(bagsofwords, collections.Counter())

So I got the frequency table for the training data, but I can't sort out how to convert the data itself to some format that could be used as an input in main_ds variable.

Text conversion in python

Answers (1)

Related Questions