Reputation: 679
I am working on a time series binary classification problem. Where each of my rows represent a single person (i got N consumers), and the columns are daily measurements from a single variable that i got from him (K measurements for each consumer). Then, i need to detect if someone has commited fraud or not (FLAG 1 or 0). A small example here:
data = {'CONS_NO': [1,2,3,'N'], 'Day_1': [1, 2, 3, 4], 'Day_2': [200, 321, 0, 128], 'Day_K': [123, 0, 3, 1], 'FLAG':[1,1,0,0]}
# Create DataFrame
df = pd.DataFrame(data)
df
CONS_NO Day_1 Day_2 Day_K FLAG
0 1 1 200 123 1
1 2 2 321 0 1
2 3 3 0 3 0
3 N 4 128 1 0
The way my dataset is now, the first 3000 rows are made up of consumers who committed fraud, while the rest of the rows are made up of honest consumers.
I have seen that i shouldn't shuffle my columns and that i need to use something like TimeSeriesSplit()
to split my train/test sets. But, is it ok to shuffle the rows in my dataframe? Or to be more precise, do i really need to do this? Will it help with training my model?
Upvotes: 0
Views: 1108
Reputation: 2532
You generally want to shuffle your data in order to ensure your train and test sets are representative of the overall (data) distribution.
Shuffling data is important if you are going to split the data between train and test or if you're doing batch training, for example, batch SGD
. If it's a simple learning algorithm, for example where the MLE
can be done on the full dataset in memory, and the dataset is simply for training then shuffling is not necessary.
To shuffle your data:
df = df.sample(frac=1).reset_index(drop=True)
Or you could use sklearn
to shuffle & split the data:
from sklearn.model_selection import train_test_split
X = df.iloc[:, :-1]
y = df['FLAG']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, shuffle=True)
This post has a good discussion on the importance of shuffling.
This is assuming the time-series order is not important and i.i.d
, which should be the case if you're using traditional supervised learning algorithms e.g. logistic regression.
Upvotes: 1