Multinomial Naive Bayes with scikit-learn for continuous and categorical data

Question

I'm new to scikit-learn, I'm trying to create a Multinomial Bayes model to predict movies box office. Below is just a toy example, I'm not sure if it is logically correct (suggestions are welcome!). The Y's corresponds to the estimate gross I'm trying to predict (1: < $20mi, 2: > $20mi). I also discretized the number of screens the movie was shown.

The question is, is this a good approach to the problem? Or would it be better to assign numbers to all categories? Also, is it correct to embed the labels (e.g. "movie: Life of Pie") in the DictVectorizer object?

def get_data():

    measurements = [ \
    {'movie': 'Life of Pi', 'screens': "some", 'distributor': "fox"},\
    {'movie': 'The Croods', 'screens': "some", 'distributor': "fox"},\
    {'movie': 'San Fransisco', 'screens': "few", 'distributor': "TriStar"},\
    ]
    vec = DictVectorizer()
    arr = vec.fit_transform(measurements).toarray()

    return arr

def predict(X):

    Y = np.array([1, 1, 2])
    clf = MultinomialNB()
    clf.fit(X, Y)
    print(clf.predict(X[2]))

if __name__ == "__main__":
    vector = get_data()
    predict(vector)

Multinomial Naive Bayes with scikit-learn for continuous and categorical data

Answers (1)

Related Questions