Rami
Rami

Reputation: 8314

Ground-truth and feature extraction for predictive modelling

I have a dataset of users, each user has has daily information about his activities (numerical values representing some measurements of his physical activities).

In addition, each user in each day has a boolean value that represents if he/she took a particular action.

The dataset looks as follow

+------+----------+----------+----------+-------+
|userID|      date| activity1| activity2| action|
+------+----------+----------+----------+-------+
| user1|2016-06-05|       5.3|         6|  false|
| user1|2016-06-04|       3.1|         8|   true|
| user1|2016-06-03|       2.0|        13|  false|
| user1|2016-06-02|       4.7|         1|  false|
| user1|2016-06-01|       1.3|         9|  false|
| user1|   ...ect.|       ...|       ...|    ...|
| user2|2016-06-05|       0.6|         5|   true|
| user2|2016-06-04|       3.0|         5|  false|
| user2|2016-06-03|       0.0|         0|  false|
| user2|2016-06-02|       2.1|         3|  false|
| user2|2016-06-01|       6.3|         9|  false|
| user2|   ...ect.|       ...|       ...|    ...|
| user3|2016-06-05|       5.3|         0|  false|
| user3|2016-06-04|       5.3|        11|  false|
| user3|2016-06-03|       6.8|         5|  false|
| user3|2016-06-02|       4.9|         2|  false|
| user3|   ...ect.|       ...|       ...|    ...|
+------+----------+----------+----------+-------+

Note that the dataset is not fixed, so one new row is added for each user on every new day. But the number of columns is fixed.

Goal

Build a model that predicts which user is likely to take the action in the near future (e.g. in any of the next 7 days).

Approach

My approach is to build feature vectors representing the activity values for each users over a period of time, and use the action column as a source of ground-truth. Then I feed the ground-truth and the feature vectors to a binary classification training algorithm (e.g. SVM or Random Forest) in order to generate a model able to predict if a user is likely to take the action or not.

Problem

I started by the positive examples that are the users who took the action. To extract the feature vector of a positive example, I combined the activity values in the X (30 or 7 or 1) days preceding the action (the day of taking the action is included).

When I moved to the negative examples, it gets less obvious, I am not sure how to select negative examples and how to extract features from them. This has led me actually to re-question if my way of selecting positive examples and building my the feature vectors was correct.

Questions

  1. How to build the ground-truth of positive (users who did take the action) and negative (users who didn't take the action) examples?
  2. What is a negative example in this case? is it the user who didn't take the action in a fixed period of time? What if he didn't take the action in this fixed period, but he just took it right after?
  3. What are the possible approaches of selecting the ranges of dates to extract feature vectors from.

Rational Question

Is there more suitable approaches (other than classification) to solve this kind of problems?

Upvotes: 2

Views: 229

Answers (1)

Horia Coman
Horia Coman

Reputation: 8781

You're off on a good start with the representation you have. If you look at the last X days of activity for a user before they take an action, you have M time series, one for each activity. In your example M=2, but in practice, from what I gather, you'd have many more. You can then concatenate the M time series to obtain an M*X dimensional feature vector.

For your example, if we take M=2 and X=5, we'd have, for user 1, starting at 2016-06-05 and going back, one time series for activity 1 [1.3 4.7 2.0 3.1 5.3] and one time series for activity 2 [9 1 13 8 6] which you can then concatenate to obtain the feature vector [1.3 4.7 2.0 3.1 5.3 9 1 13 8 6 action=false].

Build loads of these and feed them to a binary classifier and you've got the basis for something neat.

Things depend a little bit on what the action is, and how rare it occurs: - If the action is big, non-reversible and rare, such as "has signed up for our premium product" or "has a heart attack", then you are safe in looking at the data as prescribed above. - If the action occurs more often, and can occur multiple times for a user, such as "has shared his running status on Facebook from our app today", then you need to more aggressively filter negatives, and perhaps only look at a smaller window, or only at users who never do the action etc.

In general, I would just try a simple thing and see what performance I obtain on an independent test set. If it's good perhaps there's no need for further engineering. If it's bad, you start tweaking things in your ML pipeline, starting with feature extraction, and going down to parameters of your model or training algorithm.

As another modeling choice, if each activity produces a relatively continuous signal for those X days, rather than it being spiky, with many days of inactivity followed by one of activity, I would go the route of using a neural network, or a SVM with signals aware kernels at the least, especially once you have a beefier feature extraction setup. Random Forests are not going to be so great for signals in this case.

You might also pose the problem as one of anomaly detection, especially if it's very hard to build one class (negatives or positives), but not the other. In this setup you basically model the distribution of one class, and then consider anything which has a low probability under that distribution to be an anomaly or outlier. The Coursera ML Course is a good starting point for anomaly detection. I believe they just build a Multivariate Gaussian, which is definitely something which can be improved upon. Your kNN suggestion form the comments is also good, though it's gonna be more computationally complex. The problem is basically one of density estimation in a first form, so anything from that toolset is good (parametric methods like mixture of gaussians, random fields etc or non-parametric method like kNN or Gaussian processes etc.).

For your question 2, don't worry that much about what is positive and negative. You're dealing with imperfect information. Whatever system you have is going to have false positives and false negatives. You could have a user who for 10 years doesn't do the action, but then on the 3651th day they do it. Does that mean the previous 10 years worth of data is invalid? Not really - they are still good examples of what a user which doesn't sign up does. You have to take care not to have a too bad negative setup - one where say, more than half of the X days are days which are positive, but the whole series ends in a negative, but that's another meta-parameter you can tweak in order to get good results.

Similarly for question 3, X is a meta-parameter. It controls the whole process, rather than just a one model or another. One approach to selection is going by gut feeling or "domain knowledge". X=1 is too small, X=365 is too big, but X=14 or X=30 seem reasonable. If the number of parameters and their domains aren't that great, you could even do a grid search - try every combination in part, and choose the one which gives a pipeline with best results. The problem itself is one of combinatorial optimization, and grid search is a very basic algorithm for solving this, so you can basically go wild with this sub-problem as well.

Definitely check out the chapters on proper algorithm performance evaluation and the bias-variance tradeoffs in the above Coursera courses, since, with limited data, you might be backing yourself in a pipeline too specialised for the training data, but which does not generalize well.

Upvotes: 1

Related Questions