Matthijs
Matthijs

Reputation: 1

Regression with vector as independent variable instead of multiple single value variables?

I am doing a project where I am trying to predict user scores (reviews) of books by plotting the sentiment of the sentences in the book.

A graph to give you an idea:

Red is the average sentiment plot of the highest scoring 25% of books, blue the worst.

As you can see the books start out pretty medium, drop before the end, and all have high sentiment in the very end. As you can also see, the best scoring 25% (red), reaches the highest high at the end.

What I want to do is use regression to predict the score of a book based on a vector containing the sentiment scores of every sentence in the book.

I have tried a few things, but nothing works.

My idea was to split all books into 100 parts, take the average of these 100 parts for every book, and train a Support Vector Regression model (with poly kernel) on this data. However it does not perform better than just predicting the mean score everytime.

So:

1 Independent variable = [avg_sentiment1,avg_sentiment2,...avg_sentiment100] 1 Dependendent variable = score (a number between 1 and 5, or more specifically, in our dataset, between ~3.200 and ~4.700)

So while fitting the sklearn SVR model using this setup does not give any errors, it just does not seem to learn (it performs worse than a dummy predictor that always predicts the mean). I have tried a few different Regression models (Ridge, SVR with different kernels, adding a SplineTransformer) with different parameters. None do better than random.

And all examples of regression I can find online seem to use either singular value predicting singular value (so independent variable age = 12, dependent variable height = 160 cm, stuff like that), or at most multiple variables (adding more singular values, like weight = 67 (kg), as independent variables).

Is my regression interpreting my 100 number vector as 100 unrelated variables? Does that matter?

Help, I am out of my depth here. What kind of technique would be most applicable here? Preferably something I can find on SKLearn, I am not an expert (as you can probably tell)

Upvotes: 0

Views: 144

Answers (1)

MuhammedYunus
MuhammedYunus

Reputation: 5010

It sounds like you've started by condensing the input size from 2000 down to 100, i.e. dimensionality reduction. Usually this affords benefits in terms of speed and memory, and might also improve performance.

However, before trying dimensionality reduction, I think it's worth seeing how the original features perform. You can feed your data straight into a ExtraTreesRegressor, RandomForestRegresssor, or Hist/GradientBoostingRegressor and see how it does. No need to preprocess features for the aforementioned regressors.

My idea was to split all books into 100 parts, take the average of these 100 parts for every book, and train a Support Vector Regression model (with poly kernel) on this data.

Looking at graphs you've shown, I think this approach to dimensionality reduction is likely diluting some of the key discriminative features of the data. Comparing the red and blue classes, what they have in common is that they're somewhat periodic (cycling every ~250 steps); they trend in the same direction; and they swing up at the end (to different degrees). The blue graph is offset by about half a cycle.

When downsampling this data to 100 points, the periodic nature will be attenuated, and you'd be left with only the general trend, which is similar for both red and blue. Also, the upswing lasts only for a relatively short period, so averaging it could make the red and blue tails less distinct. In short, the new features seem likely to draw the classes closer together in feature space, rather than helping to separate them.

Some alternative ways of dimensionality reduction using sklearn:

  • PLSRegression reduces the dimensionality of your data whilst taking the target into account.
  • PCA reduces the dimensionality of the data by discarding features that don't vary much. However, it doesn't look at the target, so it could miss useful information.

Is my regression interpreting my 100 number vector as 100 unrelated variables? Does that matter?

They work collectively to define the decision boundary. The model doesn't take their ordering into account though, meaning that you can train the model on shuffled data and it will give you the same results.

Processing sequences

The data you have is sequential in nature, but the methods above ignore this. If the sequential characteristic is important, then it might be worth considering PyTorch or similar for training a sequence-based model. In your case, you'd set up a sequence-to-vector architecture, where the input is the sequence of sentiments and the output is a single score for the entire sequence.

There are different ways of processing sequences using neural nets. Given how long your sequences are (2000 in their original form), the first stage of the model could be a convolutional block that condenses the 2000 samples down to a smaller size. This intermediate result could either feed into another convolutional block that does the rest, or alternatively into a recurrent model. At the end, there will be a layer that maps things to a final predicted score.

Upvotes: 0

Related Questions