Kevin
Kevin

Reputation: 3239

Python - Split data into n stratified parts

I have a dataset of a few thousand samples (X and y) and I wanted to split it into n equal parts, with each part I want to split these into train/test. From what I understand stratified k-fold from sklearn is almost what I want, but it does not split each chunk into train/test.

Is there another function that can do this for me?

enter image description here

Upvotes: 2

Views: 2603

Answers (2)

Kevin
Kevin

Reputation: 3239

This worked for me:

from random import shuffle
n_splits = 10
n_classes = 2
#Get each of the classes into their own list of samples
class_split_list = {}
for i in range(n_classes):
    class_list = list(set(data.iloc[data.groupby(['normal']).groups[i]].sample_id.tolist()))
    shuffle(class_list)
    class_split_list[i] = np.array_split(class_list,n_splits)#create a dict of split chunks

stratified_sample_chunks = []
for i in range(n_splits):
    class_chunks = []
    for j in range(n_classes):
        class_chunks.extend(class_split_list[j][i])#get split from current class
    stratified_sample_chunks.append(class_chunks)

print(stratified_sample_chunks[0][:20])

You can change the class_list = list(set(data.iloc[data.groupby(['normal']).groups[i]].sample_id.tolist())) to class_list = list(set(data.iloc[data.groupby(['Column_with_y_values']).groups[i]].index.tolist()))

Upvotes: 1

from sklearn.model_selection import train_test_split
n = 10
chunk_size = int(df.shape[0] / n) + 1
for i in range(n):
  start = chunk_size * i
  data = df.iloc[start: start + chunk_size]
  X_data = data.drop(['target'], axis=1)
  y_data = data['target']
  X_train, X_test, y_train, y_test = train_test_split(X_data, y_data)

Upvotes: 0

Related Questions