Abigail
Abigail

Reputation: 41

How to divide dataset into test, validation and training using specific indices?

I have a large dataset of about 189000 rows with a total of 16 columns. I want to divide it into training 80% and testing 20%. The rows in dataset itself is divided into groups with first portion of rows are related to news, second to sports, third to religion and the last one's are general. I can't directly split it into 80:20 since most of classes that lies in lower part of the dataset will be missed in training. Also how can I select validation set from such dataset?

Upvotes: 0

Views: 592

Answers (1)

beesleep
beesleep

Reputation: 1362

If I understood well your question, while selecting your training dataset, you want to preserve the proportion of the different row types. I suggest you to select 80% of the rows for each row type.

% rowType: 1: news, 2: sport, 3: religion, 4: general
% dataset: original dataset variable
trainingSelected = false(size(dataset,1),1);
p = 0.8;
for i=1:4
    rTypeIdx = find(rowType==i);
    n = numel(rTypeIdx)

    sel = randperm(n, round(n*p));
    trainingSelected(rTypeIdx(sel)) = true;
end

If you don't want to strictly keep the proportions, you can just use randperm directly:

p = 0.8
trainingSelected = randperm(size(dataset,1), round(size(dataset,1)*p));

Upvotes: 2

Related Questions