Nikola
Nikola

Reputation: 890

Naive Bayes classifier extractive summary

I am trying to train a naive bayes classifier and I am having troubles with the data. I plan to use it for extractive text summarization.

Example_Input: It was a sunny day. The weather was nice and the birds were singing.
Example_Output: The weather was nice and the birds were singing.

I have a dataset that I plan to use and in every document there is at least 1 sentence for summary.

I decided to use sklearn but I don't know how to represent the data that I have. Namely X and y.

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X, y)

The closest to my mind is to make it like this:

X = [
        'It was a sunny day. The weather was nice and the birds were singing.',
        'I like trains. Hi, again.'
    ]

y = [
        [0,1],
        [1,0]
    ]

where the target values mean 1 - included in the summary and 0 - not included. This unfortunately gives bad shape exception beacause y is expected to be 1-d array. I cannot think of a way of representing it as such so please help.

btw, I don't use the string values in X directly but represent them as vectors with CountVectorizer and TfidfTransformer from sklearn.

Upvotes: 0

Views: 511

Answers (1)

abhinav
abhinav

Reputation: 1138

As per your requirement, you are classifying the data. That means, you need to separate each sentence to predict it's class.

for example:
Instead of using:

X = [
        'It was a sunny day. The weather was nice and the birds were singing.',
        'I like trains. Hi, again.'
    ]

Use it as following:

X = [
        'It was a sunny day.',
        'The weather was nice and the birds were singing.',
        'I like trains.',
        'Hi, again.'
    ]

Use sentence tokenizer of NLTK to achieve this.

Now, for labels, use two-classes. let say 1 for yes, 0 for no.

y = [
        [0,],
        [1,],
        [1,],
        [0,]
    ]

Now, use this data to fit and predict the way you want!

Hope it helps!

Upvotes: 1

Related Questions