Reputation: 890
I am trying to train a naive bayes classifier and I am having troubles with the data. I plan to use it for extractive text summarization.
Example_Input: It was a sunny day. The weather was nice and the birds were singing.
Example_Output: The weather was nice and the birds were singing.
I have a dataset that I plan to use and in every document there is at least 1 sentence for summary.
I decided to use sklearn but I don't know how to represent the data that I have. Namely X and y.
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X, y)
The closest to my mind is to make it like this:
X = [
'It was a sunny day. The weather was nice and the birds were singing.',
'I like trains. Hi, again.'
]
y = [
[0,1],
[1,0]
]
where the target values mean 1 - included in the summary and 0 - not included. This unfortunately gives bad shape exception beacause y is expected to be 1-d array. I cannot think of a way of representing it as such so please help.
btw, I don't use the string values in X
directly but represent them as vectors with CountVectorizer
and TfidfTransformer
from sklearn.
Upvotes: 0
Views: 511
Reputation: 1138
As per your requirement, you are classifying the data. That means, you need to separate each sentence to predict it's class.
for example:
Instead of using:
X = [
'It was a sunny day. The weather was nice and the birds were singing.',
'I like trains. Hi, again.'
]
Use it as following:
X = [
'It was a sunny day.',
'The weather was nice and the birds were singing.',
'I like trains.',
'Hi, again.'
]
Use sentence tokenizer of NLTK to achieve this.
Now, for labels, use two-classes. let say 1 for yes, 0 for no.
y = [
[0,],
[1,],
[1,],
[0,]
]
Now, use this data to fit and predict the way you want!
Hope it helps!
Upvotes: 1