Reputation: 1908
I'm new to scikit-learn, I'm trying to create a Multinomial Bayes model to predict movies box office. Below is just a toy example, I'm not sure if it is logically correct (suggestions are welcome!). The Y's corresponds to the estimate gross I'm trying to predict (1: < $20mi, 2: > $20mi). I also discretized the number of screens the movie was shown.
The question is, is this a good approach to the problem? Or would it be better to assign numbers to all categories? Also, is it correct to embed the labels (e.g. "movie: Life of Pie") in the DictVectorizer object?
def get_data():
measurements = [ \
{'movie': 'Life of Pi', 'screens': "some", 'distributor': "fox"},\
{'movie': 'The Croods', 'screens': "some", 'distributor': "fox"},\
{'movie': 'San Fransisco', 'screens': "few", 'distributor': "TriStar"},\
]
vec = DictVectorizer()
arr = vec.fit_transform(measurements).toarray()
return arr
def predict(X):
Y = np.array([1, 1, 2])
clf = MultinomialNB()
clf.fit(X, Y)
print(clf.predict(X[2]))
if __name__ == "__main__":
vector = get_data()
predict(vector)
Upvotes: 0
Views: 1696
Reputation: 28748
In principle this is correct, I think.
Maybe it would be more natural to formulate the problem as a regression on the box-office sales.
The movie
feature is useless. The DictVectorizer encodes each possible value as a different feature. As each movie will have a different title, they will all have completely independent features, and no generalization is possible there.
It might also be better to encode screens as a number, not as a one-hot-encoding of different ranges.
Needless to say, you need much better features that what you have here to get any reasonable prediction.
Upvotes: 2