Reputation:
I'm trying to predict if reviews on yelp are positive or negative by performing linear regression using SGD.
I tried two different feature extractors.
The first was the character n-gram and the second was separating words by space.
However, I tried different n values for the character n-gram, and found that the n value that gave me the best test error.
I noticed that this test error (0.27 in my test data) was nearly identical to the test error from extracting the words separated by space.
Is there a reason behind this coincidence?
Shouldn't the character n-gram have a lower test error since it extracted more features than the word features?
Character n-gram: ex. n=7 "Good restaurant" => "Goodres" "oodrest" "odresta" "drestau" "restaur" "estaura" "stauran" "taurant"
Word features: "Good restaurant" => "Good" "restaurant"
Upvotes: 0
Views: 1357
Reputation: 189668
Looks like the n-gram method simply produced a lot of redundant, overlapping features which do not contribute to the precision.
Upvotes: 3