Working with, preparing bag-of-word data for Regression

Question

Im trying to create a regression model that predicts an authors age. Im using (Nguyen et al,2011) as my basis.

Using a Bag of Words Model I count the occurences of words per Document (which are Posts from Boards) and create the vector for every Post.

I limit the size of each vector by using as features the top-k (k=number) most frequent used words(stopwords will not be used)

Vectorexample_with_k_8 = [0,0,0,1,0,3,0,0]

My data is generally sparse like in the Example.

When I test the model on my test data I get a very low r² score(0.00-0.1), sometimes even a negative score. The model predicts always the same age, which happens to be the average age of my dataset, like seen in the distribution of my data (age/amount):

I used diffrerent Regression Models: Linear Regression, Lasso, SGDRegressor from scikit-learn with no improvement.

So the questions are:

1.How do I improve the r² score?

2.Do I have to change the data to fit the Regression better? If yes with what method?

3.Which Regressor/Methods should I use for text classification?

Working with, preparing bag-of-word data for Regression

Answers (1)

Related Questions