Regression with vector as independent variable instead of multiple single value variables?

Question

I am doing a project where I am trying to predict user scores (reviews) of books by plotting the sentiment of the sentences in the book.

A graph to give you an idea:

Red is the average sentiment plot of the highest scoring 25% of books, blue the worst.

As you can see the books start out pretty medium, drop before the end, and all have high sentiment in the very end. As you can also see, the best scoring 25% (red), reaches the highest high at the end.

What I want to do is use regression to predict the score of a book based on a vector containing the sentiment scores of every sentence in the book.

I have tried a few things, but nothing works.

My idea was to split all books into 100 parts, take the average of these 100 parts for every book, and train a Support Vector Regression model (with poly kernel) on this data. However it does not perform better than just predicting the mean score everytime.

So:

1 Independent variable = [avg_sentiment1,avg_sentiment2,...avg_sentiment100] 1 Dependendent variable = score (a number between 1 and 5, or more specifically, in our dataset, between ~3.200 and ~4.700)

So while fitting the sklearn SVR model using this setup does not give any errors, it just does not seem to learn (it performs worse than a dummy predictor that always predicts the mean). I have tried a few different Regression models (Ridge, SVR with different kernels, adding a SplineTransformer) with different parameters. None do better than random.

And all examples of regression I can find online seem to use either singular value predicting singular value (so independent variable age = 12, dependent variable height = 160 cm, stuff like that), or at most multiple variables (adding more singular values, like weight = 67 (kg), as independent variables).

Is my regression interpreting my 100 number vector as 100 unrelated variables? Does that matter?

Help, I am out of my depth here. What kind of technique would be most applicable here? Preferably something I can find on SKLearn, I am not an expert (as you can probably tell)

Regression with vector as independent variable instead of multiple single value variables?

Answers (1)

Related Questions