Reputation: 103
Below is sample code for imdb dataset.I am a beginner and following a tutorial, I am trying to load my own dataset in keras.How would I modify the code.I would be very grateful
import keras
#Using keras to load the dataset with the top_words
max_features = 10000 #max number of words to include, words are ranked by how often they occur (in training set)
max_review_length = 1600
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print 'loaded dataset...'
#Pad the sequence to the same length
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)
index_dict = keras.datasets.imdb.get_word_index()
Upvotes: 1
Views: 2766
Reputation: 393
Here's a simple solution with Pandas and CountVectorizer. You'll then need to pad the data and split into test and train as above.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
data = {
'label': [0, 1, 0, 1],
'text': ['first bit of text', 'second bit of text', 'third text', 'text number four']
}
data = pd.DataFrame.from_dict(data)
# Form vocab dictionary
vectorizer = CountVectorizer()
vectorizer.fit_transform(data['text'].tolist())
vocab_text = vectorizer.vocabulary_
# Convert text
def convert_text(text):
text_list = text.split(' ')
return [vocab_text[t]+1 for t in text_list]
data['text'] = data['text'].apply(convert_text)
# Get X and y matrices
y = np.array(data['label'])
X = np.array(data['text'])
Upvotes: 1