IDK
IDK

Reputation: 445

Row slicing R dataframe

I don't understand what is happening here. I have a dataframe with 253680 rows.

> class(df)
[1] "data.frame"
> nrow(dataset)
[1] 253680

I want to split it into 3 parts: 50%, 25% and 25%. So I take a look at the quartiles:

> summary(c(1:nrow(dataset)))
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1   63421  126840  126840  190260  253680

Now I'd like access to q2 and q3 to achive the split, but:

> quartiles <- as.numeric(summary(c(1:nrow(dataset))))
> quartiles
[1]      1  63421 126841 126841 190260 253680

What before were the median and the mean now have a 1 added to them. Why?

Upvotes: 0

Views: 150

Answers (2)

Onyambu
Onyambu

Reputation: 79228

This is what you are looking for:

n <- nrow(dataset);
l <-split(dataset, rep(1:3, c(m <- c(n%/%2, n%/%4), n-sum(m))));
train <- l[[1]];
validation <- l[[2]];
test <- l[[3]];

Upvotes: 1

Baraliuh
Baraliuh

Reputation: 2141

Probably just a rounding difference since the mean is .5. I cannot reproduce this in my R version though since for me as.numeric shows the exact value.

n <- 253680
s <- summary(1:n)
s
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       1   63421  126841  126841  190260  253680
as.numeric(s)
#> [1]      1.00  63420.75 126840.50 126840.50 190260.25 253680.00
c(s[4], as.numeric(s[4]))
#>     Mean          
#> 126840.5 126840.5
mean(1:n)
#> [1] 126840.5

Created on 2022-02-04 by the reprex package (v2.0.0)

Upvotes: 1

Related Questions