w00dy
w00dy

Reputation: 758

Scikit-learn labeled dataset creation from segmented time series

INTRO

I have a Pandas DataFrame that represents a segmented time series of different users (i.e., user1 & user2). I want to train a scikit-learn classifier with the mentioned DataFrames, but I can't understand the shape of the scikit-learn dataset that I must create. Since my series are segmented, my DataFrame has a 'segID' column that contains IDs of a specific segment. I'll skip the description of the segmentation since it is provided by an algorithm.

Let's take an example where both user1 and user2 has 2 segments: print df

        username  voltage        segID  
0       user1     -0.154732      0  
1       user1     -0.063169      0  
2       user1      0.554732      1  
3       user1     -0.641311      1  
4       user1     -0.653732      1  
5       user2      0.446469      0  
6       user2     -0.655732      0  
7       user2      0.646769      0  
8       user2     -0.646369      1  
9       user2      0.257732      1  
10      user2     -0.346369      1

QUESTIONS:

scikit-learn dataset API says to create a dict containing data and target, but how can I shape my data since they are segments and not just a list?

I can't figure out my segments fitting into the n_samples * n_features structure. I have two ideas:

1) every data sample is a list representing a segment, on the other hand, target is different for each data entry since they're grouped. What about target_names? Could this work?

{
    'data': array([
        [[-0.154732, -0.063169]],
        [[ 0.554732, -0.641311, -0.653732],
        [[ 0.446469, -0.655732, 0.646769]],
        [[-0.646369, 0.257732, -0.346369]]
        ]), 
    'target': 
        array([0, 1, 2, 3]),
    'target_names': array(['user1seg1', 'user1seg2', 'user2seg1', 'user2seg2'], dtype='|S10')

}

2) data is (simply) the nparray returned by df.values. target contains segments' IDs different for each user.... does it make sense?

{
    'data': array([
        [-0.154732],
        [-0.063169],
        [ 0.554732],
        [-0.641311],
        [-0.653732],
        [ 0.446469],
        [-0.655732],
        [ 0.646769],
        [-0.646369],
        [ 0.257732],
        [-0.346369]
        ]), 
    'target': 
        array([0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3]),
    'target_names': array(['user1seg1', 'user1seg1', 'user1seg2', 'user1seg2', .....], dtype='|S10')
}

I think the main problem is that I can't figure out what to use as labels...

EDIT:

ok it's clear... labels are given by my ground truth, they are just the user's names. elyase's answer is exactly what i was looking for. In order to better state the problem, I'm going to explain here the segID meaning. In time series pattern recognition, segmenting could be useful in order to isolate meaningful segments. At testing time I want to recognize segments and not the entire series, because series is rather long and segments are supposed to be meaningful in my context.

Have a look at the following example from this implementation based on "An Online Algorithm for Segmenting Time Series". My segID is just a column representing the id of a chunk.

segmented time series

Upvotes: 4

Views: 1763

Answers (1)

elyase
elyase

Reputation: 40993

This is not trivial and there might be several way of formulating the problem for consumption by a ML algorithm. You should try them all and find how you get the best results.

As you already found you need two things, a matrix X of shape n_samples * n_features and a column vector y of length 'n_samples'. Lets start with the target y.

Target:

As you want to predict a user from a discrete pool of usernames, you have a classification problem an your target will be a vector with np.unique(y) == ['user1', 'user2', ...]

Features

Your features are the information that you provide the ML algorithm for each label/user/target. Unfortunately most algorithms require this information to have a fixed length, but variable length time series don't fit well into this description. So if you want to stick to classic algorithms, you need some way to condense the time series information for a user into a fixed length vector. Some possibilities are the mean, min, max, sum, first, last values, histogram, spectral power, etc. You will need to come up with the ones that make sense for your given problem.

So if you ignore the SegID information your X matrix will look like this:

y/features 
           min max ... sum 
user1      0.1 1.2 ... 1.1    # <-first time series for user 1
user1      0.0 1.3 ... 1.1    # <-second time series for user 1
user2      0.3 0.4 ... 13.0   # <-first time series for user 2

As SegID is itself a time series you also need to encode it as fixed length information, for example a histogram/counts of all possible values, most frequent value, etc

In this case you will have:

y/features 
           min max ... sum segID_most_freq segID_min
user1      0.1 1.2 ... 1.1 1               1
user1      0.3 0.4 ... 13  2               1
user2      0.3 0.4 ... 13  5               3

The algorithm will look at this data and will "think": so for user1 the minimum segID is always 1 so if I see a user a prediction time, whose time series has a minimum ID of 1 then it should be user1. If it is around 3 it is probably user2, and so on.

Keep in mind that this is only a possible approach. Sometimes it is useful to ask, what info will I have at prediction time that will allow me to find which user is the one I am seeing and why will this info lead to the given user?

Upvotes: 2

Related Questions