Reputation: 445
I don't understand what is happening here. I have a dataframe with 253680 rows.
> class(df)
[1] "data.frame"
> nrow(dataset)
[1] 253680
I want to split it into 3 parts: 50%, 25% and 25%. So I take a look at the quartiles:
> summary(c(1:nrow(dataset)))
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 63421 126840 126840 190260 253680
Now I'd like access to q2 and q3 to achive the split, but:
> quartiles <- as.numeric(summary(c(1:nrow(dataset))))
> quartiles
[1] 1 63421 126841 126841 190260 253680
What before were the median and the mean now have a 1 added to them. Why?
Upvotes: 0
Views: 150
Reputation: 79228
This is what you are looking for:
n <- nrow(dataset);
l <-split(dataset, rep(1:3, c(m <- c(n%/%2, n%/%4), n-sum(m))));
train <- l[[1]];
validation <- l[[2]];
test <- l[[3]];
Upvotes: 1
Reputation: 2141
Probably just a rounding difference since the mean is .5
.
I cannot reproduce this in my R version though since for me as.numeric
shows the exact value.
n <- 253680
s <- summary(1:n)
s
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1 63421 126841 126841 190260 253680
as.numeric(s)
#> [1] 1.00 63420.75 126840.50 126840.50 190260.25 253680.00
c(s[4], as.numeric(s[4]))
#> Mean
#> 126840.5 126840.5
mean(1:n)
#> [1] 126840.5
Created on 2022-02-04 by the reprex package (v2.0.0)
Upvotes: 1