Reputation: 41
I have a large dataset of about 189000 rows with a total of 16 columns. I want to divide it into training 80% and testing 20%. The rows in dataset itself is divided into groups with first portion of rows are related to news, second to sports, third to religion and the last one's are general. I can't directly split it into 80:20 since most of classes that lies in lower part of the dataset will be missed in training. Also how can I select validation set from such dataset?
Upvotes: 0
Views: 592
Reputation: 1362
If I understood well your question, while selecting your training dataset, you want to preserve the proportion of the different row types. I suggest you to select 80% of the rows for each row type.
% rowType: 1: news, 2: sport, 3: religion, 4: general
% dataset: original dataset variable
trainingSelected = false(size(dataset,1),1);
p = 0.8;
for i=1:4
rTypeIdx = find(rowType==i);
n = numel(rTypeIdx)
sel = randperm(n, round(n*p));
trainingSelected(rTypeIdx(sel)) = true;
end
If you don't want to strictly keep the proportions, you can just use randperm directly:
p = 0.8
trainingSelected = randperm(size(dataset,1), round(size(dataset,1)*p));
Upvotes: 2