Shuffling a classification timeseries data

Question

I am working on a time series binary classification problem. Where each of my rows represent a single person (i got N consumers), and the columns are daily measurements from a single variable that i got from him (K measurements for each consumer). Then, i need to detect if someone has commited fraud or not (FLAG 1 or 0). A small example here:

data = {'CONS_NO': [1,2,3,'N'], 'Day_1': [1, 2, 3, 4], 'Day_2': [200, 321, 0, 128], 'Day_K': [123, 0, 3, 1], 'FLAG':[1,1,0,0]}
  
# Create DataFrame  
df = pd.DataFrame(data)
df

  CONS_NO  Day_1  Day_2  Day_K  FLAG
0       1      1    200    123     1
1       2      2    321      0     1
2       3      3      0      3     0
3       N      4    128      1     0

The way my dataset is now, the first 3000 rows are made up of consumers who committed fraud, while the rest of the rows are made up of honest consumers.

I have seen that i shouldn't shuffle my columns and that i need to use something like TimeSeriesSplit() to split my train/test sets. But, is it ok to shuffle the rows in my dataframe? Or to be more precise, do i really need to do this? Will it help with training my model?

RK1 · Accepted Answer

You generally want to shuffle your data in order to ensure your train and test sets are representative of the overall (data) distribution.

Shuffling data is important if you are going to split the data between train and test or if you're doing batch training, for example, batch SGD. If it's a simple learning algorithm, for example where the MLE can be done on the full dataset in memory, and the dataset is simply for training then shuffling is not necessary.

To shuffle your data:

df = df.sample(frac=1).reset_index(drop=True)

Or you could use sklearn to shuffle & split the data:

from sklearn.model_selection import train_test_split
X = df.iloc[:, :-1]
y = df['FLAG']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)

This post has a good discussion on the importance of shuffling.

This is assuming the time-series order is not important and i.i.d, which should be the case if you're using traditional supervised learning algorithms e.g. logistic regression.

Shuffling a classification timeseries data

Answers (1)

Related Questions