ksiomelo
ksiomelo

Reputation: 1908

Multinomial Naive Bayes with scikit-learn for continuous and categorical data

I'm new to scikit-learn, I'm trying to create a Multinomial Bayes model to predict movies box office. Below is just a toy example, I'm not sure if it is logically correct (suggestions are welcome!). The Y's corresponds to the estimate gross I'm trying to predict (1: < $20mi, 2: > $20mi). I also discretized the number of screens the movie was shown.

The question is, is this a good approach to the problem? Or would it be better to assign numbers to all categories? Also, is it correct to embed the labels (e.g. "movie: Life of Pie") in the DictVectorizer object?

def get_data():

    measurements = [ \
    {'movie': 'Life of Pi', 'screens': "some", 'distributor': "fox"},\
    {'movie': 'The Croods', 'screens': "some", 'distributor': "fox"},\
    {'movie': 'San Fransisco', 'screens': "few", 'distributor': "TriStar"},\
    ]
    vec = DictVectorizer()
    arr = vec.fit_transform(measurements).toarray()

    return arr

def predict(X):

    Y = np.array([1, 1, 2])
    clf = MultinomialNB()
    clf.fit(X, Y)
    print(clf.predict(X[2]))

if __name__ == "__main__":
    vector = get_data()
    predict(vector)

Upvotes: 0

Views: 1696

Answers (1)

Andreas Mueller
Andreas Mueller

Reputation: 28748

In principle this is correct, I think.

Maybe it would be more natural to formulate the problem as a regression on the box-office sales.

The movie feature is useless. The DictVectorizer encodes each possible value as a different feature. As each movie will have a different title, they will all have completely independent features, and no generalization is possible there.

It might also be better to encode screens as a number, not as a one-hot-encoding of different ranges.

Needless to say, you need much better features that what you have here to get any reasonable prediction.

Upvotes: 2

Related Questions