Reputation: 3239
I have a dataset of a few thousand samples (X and y) and I wanted to split it into n equal parts, with each part I want to split these into train/test. From what I understand stratified k-fold from sklearn is almost what I want, but it does not split each chunk into train/test.
Is there another function that can do this for me?
Upvotes: 2
Views: 2603
Reputation: 3239
This worked for me:
from random import shuffle
n_splits = 10
n_classes = 2
#Get each of the classes into their own list of samples
class_split_list = {}
for i in range(n_classes):
class_list = list(set(data.iloc[data.groupby(['normal']).groups[i]].sample_id.tolist()))
shuffle(class_list)
class_split_list[i] = np.array_split(class_list,n_splits)#create a dict of split chunks
stratified_sample_chunks = []
for i in range(n_splits):
class_chunks = []
for j in range(n_classes):
class_chunks.extend(class_split_list[j][i])#get split from current class
stratified_sample_chunks.append(class_chunks)
print(stratified_sample_chunks[0][:20])
You can change the class_list = list(set(data.iloc[data.groupby(['normal']).groups[i]].sample_id.tolist()))
to class_list = list(set(data.iloc[data.groupby(['Column_with_y_values']).groups[i]].index.tolist()))
Upvotes: 1
Reputation: 599
from sklearn.model_selection import train_test_split
n = 10
chunk_size = int(df.shape[0] / n) + 1
for i in range(n):
start = chunk_size * i
data = df.iloc[start: start + chunk_size]
X_data = data.drop(['target'], axis=1)
y_data = data['target']
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data)
Upvotes: 0