KOB
KOB

Reputation: 4545

How to format a data set for time series prediction in H2O's Driverless AI

For simplicity, say that I am attempting to predict the following day of a sequence of single-valued variables, therefore my datasaet would be in the form of:

input    label
   x1       x2
   x2       x3
   x3       x4
  ...      ...
   xt      xt+1

However, my data has the same sequences in time for many different users, therefore it is in the following form:

input    label
 u1x1     u1x2
 u1x2     u1x3
 u1x3     u1x4
  ...      ...
 u1xt   u1xt+1
 u2x1     u2x2
 u2x2     u2x3
 u2x3     u2x4
  ...      ...
 u2xt   u2xt+1
  ...      ...
 unx1     unx2
 unx2     unx3
 unx3     unx4
  ...      ...
 unxt   unxt+1

What is an acceptable way to structure this data and feed it into DAI such that it is not treated as one entire long sequence, but rather a bunch of not directly related sequences parallel in time?

Edit: The data has a 'UserID' column. Can DAI automatically use this to overcome the problem I am explaining?

Upvotes: 1

Views: 165

Answers (1)

Lauren
Lauren

Reputation: 5778

To format your data for forecasting, you need to aggregate your data for each group of interest and for a specific time period (in your case one day).

So if your forecast horizon is one day, you need to aggregate by user, your single-valued variable, and by day so that you have a target (label) as a total amount per day. You can find documentation on how to setup your data for driverless here and here.

EDIT in response to comment:

Here is another example to explain the expected data format using the assumption that each user should be aggregated at the day level:

If you have one day’s worth of data for 5 users your dataset should only have 5 rows, but if you have 10 days worth of data for 5 users you should have 50 rows of data.

Then in Driverless AI when you set up your experiment you would set your Time Group to the User column

Upvotes: 1

Related Questions