Agate
Agate

Reputation: 3232

Predict a user input using current datetime and past history

The project

I'm working on a small project where users can create events (such as Eat, Sleep, Watch a movie, etc.) and record log entries matching these events.

My data model looks like this (the application itself is in Python 3 / Django, but I don't think it matters here):

# a sample event
event = {
    'id': 1,
    'name': 'Eat',
}

# a sample entry
entry = {
    'event_id': event['id'],
    'user_id': 12,

    # record date of the entry
    'date': '2017-03-16T12:56:32.095465+00:00',

    # user-given tags for the entry
    'tags': ['home', 'delivery'],

    # A user notation for the entry, it can be positive or negative
    'score': 2,

    'comment': 'That was a tasty meal',

}

Users can record an arbitrary number of entries for an arbitrary number of events, they can create new events when they need to. The data is stored in a relational database.

Now, I'd like to make data entry easier for users, by suggesting them relevant events when they visit the "Add an entry" form. At the moment, they can select the event corresponding to their entry in a dropdown, but I want to suggest them a few relevant events on top of that.

I was thinking that, given a user history (all the recorded entries), it should be possible to predict likely inputs, by identifying patterns in entries, e.g:

Ideally, I'd like a function, that, given a user ID and a datetime, and using the user history, will return a list of events that are more likely to occur:

def get_events(user_id, datetime, max=3):
    # implementation

    # returns a list of up to max events
    return events

So if I take the previous example (with more human dates), I would get the following results:

>>> get_events(user_id, 'Friday at 9:00 PM')
['Watch a movie', 'Sleep', 'Eat']

>>> get_events(user_id, 'Friday at 9:00 PM', max=2)
['Watch a movie', 'Sleep']

>>> get_events(user_id, 'Monday at 9:00 PM')
['Sleep', 'Eat', 'Watch a movie']

>>> get_events(user_id, 'Monday at noon')
['eat']

Of course, in the real life, I'll pass real datetimes, and I'd like to get an event ID so I can get corresponding data from the database.

My question

(Sorry, if it took some time to explain the whole thing)

My actual question is, what are the actual algorithms / tools / libraries required to achieve this? Is it even possible to do?

My current guess is that I'll need to use some fancy machine learning stuff, using something like scikit-learn and classifiers, feeding it with the user history to train it, then ask the whole thing to do its magic.

I'm not familiar at all with machine learning and I fear I don't have enough mathematical/scientific background to get started on my own. Can you provide me some reference material to help me understand how to solve this, the algorithms / vocabulary I have to dig in, or some pseudocode ?

Upvotes: 3

Views: 841

Answers (1)

PidgeyUsedGust
PidgeyUsedGust

Reputation: 817

I think a k-nearest neighbours (kNN) approach would be a good starting point. The idea in this specific case is to look for the k events that happen closest to the given time-of-day and count the ones that occur the most often.

Example

Say you have as input Friday at 9:00 PM. Take the distance of all events in the database to this date and rank them in ascending order. For example if we take the distance in minutes for all elements in the database, an example ranking could be as follows.

('Eat', 34)
('Sleep', 54)
('Eat', 76)
   ...
('Watch a movie', 93)

Next you take the first k = 3 of those and compute how often they occur,

('Eat', 2)
('Sleep', 1)

so that the function returns ['Eat', 'Sleep'] (in that order).

Choosing a good k is important. Too small values will allow accidental outliers (doing something once on a specific moment) to have a large influence on the result. Choosing k too large will allow for unrelated events to be included in the count. One way to alleviate this is by using distance weighted kNN (see below).

Choosing a distance function

As mentioned in the comments, using a simple distance between two timestamps might lose you some information such as day-of-week. We can solve this by making the distance function d(e1, e2) slightly more complex. In this case, we can choose it to be a tradeoff between time-of-day and day-of-week, e.g.

d(e1, e2) = a * |timeOfDay(e1) - timeOfDay(e2)| * (1/1440) + 
            b * |dayOfWeek(e1) - dayOfWeek(e2)| * (1/7)

where we normalised both differences by the maximum difference possible between times in a day (in minutes) and days of the week. a and b are parameters that can be used to give more weight to one of these differences. For example if we choose a = 3 and b = 1 we say that occurring at the same day is three times more important than occurring at the same time.

Distance weighted kNN

You can increase complexity (and hopefully performance) by not simply selecting the k closest elements but by assigning a weight (such as the distance) to all events according to their distance of the given point. Let e be the input example and o be an example from the database. Then we compute o's weight with respect to e as

          1
w_o = ---------
      d(e, o)^2

We see that points lose weight faster than their distance to e increases. In your case, a number of elements is then to be selected from the final ranking. This can be done by summing the weights for identical events to compute a final ranking of event types.

Implementation

The nice thing about kNN is that it is very easy to implement. You'll roughly need the following components.

  • An implementation of the distance function d(e1, e2).
  • A function that ranks all elements in the database according to this function and a given input example.

    def rank(e, db, d):
        """ Rank the examples in db with respect to e using
            distance function d.
        """
        return sorted([(o, d(e, o)) for o in db],
                      key=lambda x: x[1])
    
  • A function that selects some elements from this ranking.

Upvotes: 3

Related Questions