Reputation: 169
Im trying to create a regression model that predicts an authors age. Im using (Nguyen et al,2011) as my basis.
Using a Bag of Words Model I count the occurences of words per Document (which are Posts from Boards) and create the vector for every Post.
I limit the size of each vector by using as features the top-k (k=number) most frequent used words(stopwords will not be used)
Vectorexample_with_k_8 = [0,0,0,1,0,3,0,0]
My data is generally sparse like in the Example.
When I test the model on my test data I get a very low r² score(0.00-0.1), sometimes even a negative score. The model predicts always the same age, which happens to be the average age of my dataset, like seen in the
distribution of my data (age/amount):
I used diffrerent Regression Models: Linear Regression, Lasso, SGDRegressor from scikit-learn with no improvement.
So the questions are:
1.How do I improve the r² score?
2.Do I have to change the data to fit the Regression better? If yes with what method?
3.Which Regressor/Methods should I use for text classification?
Upvotes: 2
Views: 3190
Reputation: 1374
To my knowledge Bag-of-words models usually use Naive Bayes as classifier to fit the document-by-term sparse matrix.
None of your regressors can handle large sparse matrix well. Lasso may work well if you have groups of highly correlated features.
I think for your problem, Latent Semantic Analysis may provide better results. Essentially, use the TfidfVectorizer to normalize the word count matrix, then use TruncatedSVD to reduce the dimensionality to retain the first N components which capture the major variance. Most regressors should work well with the matrix in lower dimension. In my experimence SVM works pretty good for this problem.
Here I show an example script:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV
pipeline = Pipeline([
('tfidf', TfidfVectorizer()),
('svd', TruncatedSVD()),
('clf', svm.SVR())
])
# You can tune hyperparameters using grid search
params = {
'tfidf__max_df': (0.5, 0.75, 1.0),
'tfidf__ngram_range': ((1,1), (1,2)),
'svd__n_components': (50, 100, 150, 200),
'clf__C': (0.1, 1, 10),
}
grid_search = GridSearchCV(pipeline, params, scoring='r2',
n_jobs=-1, verbose=10)
# fit your documents (Should be a list/array of strings)
grid_search.fit(documents, y)
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
Upvotes: 7