Reputation: 4608
I have a set of numerical features (f1, f2, f3, f4, f5)
as follows for each user in my dataset.
f1 f2 f3 f4 f5
user1 0.1 1.1 0 1.7 1
user2 1.1 0.3 1 1.3 3
user3 0.8 0.3 0 1.1 2
user4 1.5 1.2 1 0.8 3
user5 1.6 1.3 3 0.3 0
My target output is a prioritised user list. i.e. as shown in the example below.
f1 f2 f3 f4 f5 target_priority
user1 0.1 1.1 0 1.7 1 2
user2 1.1 0.3 1 1.3 3 1
user3 0.8 0.3 0 1.1 2 5
user4 1.5 1.2 1 0.8 3 3
user5 1.6 1.3 3 0.3 0 4
I want to use these features in a way that reflect the priority of the user.
Currently, I am multifying all the features of each user to get a score and rank the users based on the score (example shown below).
f1 f2 f3 f4 f5 multipled_score predicted_priority
user1 0.1 1.1 0 1.7 1 0 5
user2 1.1 0.3 1 1.3 3 1.287 2
user3 0.8 0.3 1 1.1 2 0.528 4
user4 1.5 1.2 1 0.8 3 4.32 1
user5 1.6 1.3 1 0.3 1 0.624 3
However, merely multiplying the features and rank based on the multiplied score
did not perform well. I think the features should be upweight
or downweight
based on their contribution in correctly predicting the priority.
Therefore, I would like to know if there is a way (in machine learning/data science/statistics) to get a optimal ranking function using the scores of my features to get a ranked list closed to the real ranking.
I am happy to provide more details if needed.
Upvotes: 1
Views: 156
Reputation: 88236
One way to tackle this problem is by using a machine learning algorithm that tries to learn the underlying function in order to predict the most likely score of a new user based on its features.
Note however, that the model will not perform well unless the amount of samples isn't high enough. Five samples are obviously not enough, this is just a sketch to give you an idea on how you could approach this using machine learning.
I will be using RandomForestRegressor
as an example:
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
Lets start by defining the features and target that will be fed to the model.
X_ = df.drop(['target_priority'], axis=1).values
scaler = MinMaxScaler()
X = scaler.fit_transform(X_)
y = df.target_priority
Now lets fit the model:
rf = RandomForestRegressor()
rf.fit(X,y)
Here I have not split the data in train and test sets, but you should be doing that in order to have an idea of how well your model performs. Here given that there is a single sample for each existing target, I've trained the model with all samples, and will be creating a test set by adding some noise to the training data:
noise = np.random.normal(loc=0, scale=0.2, size=X.shape)
X_test = X + noise
And now you can obtain predictions on the test set using the trained model:
y_pred = rf.predict(X_test).round()
# array([2., 2., 4., 3., 4.])
A you can see, even with the small amount of samples used to train the model, the model has been able to predict with an average error of only 0.4
:
np.abs(y - y_pred).mean()
# 0.4
Upvotes: 2