Reputation: 1
To reduce the dataset, I have been advised to use "stratified sampling".
Because I'm very new to the R programming, current articles on Stack aren't easy to follow, there is very little explanation.
I have a data set of over 60000 obs. and 24 variables. Out of all variables, 21 are quantitative (numbers).
How do I get sample data out of that? Also - Where do I specify the dataset name - do I need to name the new "reduced" dataset, so I could include it for the further analysis?
ADDED CODE (this is what I used for sampling):
# Sample a percentage of values from each stratum (10% in this case)
DB.quant.sample = lapply(split(DB.quant, DB.quant$group_size), function(DB.quant) {
DB.quant[sample(1:nrow(DB.quant), ceiling(nrow(DB.quant) * 0.1)), ]
})
Browse[1]>DB.quant[sample(1:nrow(DB.quant), 6000), ]
#DB.quant is the dataset and group_size is one of the variable. I'm not sure which variable should I use?
I'm having problem with illustrating graphically and intuitively how a cluster algorithm works I started:
DB <- na.omit(DB)
DB.quant <- DB[c(2,3,4,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24)]
And then:
d <- dist(DB.quant.sample) # but im getting an error:
Error in dist(DB.quant.sample, method = "euclidean") : (list) object cannot be coerced to type 'double'
Example image of my DataSet: first few rows of the data
Upvotes: 0
Views: 433
Reputation: 93761
I'm not sure exactly how you want to sample, but here's a simple example using the built-in iris
data frame. Below are two ways to do it. One using Base R and the other using the dplyr
package.
split
the data frame into three separate smaller data frames, one for each Species
.Randomly sample
5 rows per Species
.
# Sample 5 rows from each stratum
df.sample = lapply(split(iris, iris$Species), function(df) {
df[sample(1:nrow(df), 5), ]
})
# Sample a percentage of values from each stratum (10% in this case)
df.sample = lapply(split(iris, iris$Species), function(df) {
df[sample(1:nrow(df), ceiling(nrow(df) * 0.1)), ]
})
This gives us a list
containing three data frames, one for each of the three different unique values of Species
.
Combine the three samples into a single data frame.
df.sample = do.call(rbind, df.sample)
Do the grouping and sampling in a single chain of functions using the pipe (%>%
) operator:
library(dplyr)
# Sample 5 values from each stratum
df.sample = iris %>%
group_by(Species) %>%
sample_n(5)
# Sample a percentage of values from each stratum (10% in this case)
df.sample = iris %>%
group_by(Species) %>%
sample_frac(0.1)
Upvotes: 2