Reputation: 55
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE))
Pick a random start point within each id
and from that point, select that row and subsequent, sequential rows totaling 1% of the rows within that ID. Then do it again for 2% of each ID's rows, and 3% and so on up to 99% of the rows per ID. Also, do not select a random point to begin sampling that is closer to the end of the ID's rows than the percentage desired to be samples (i.e., don't try to sample 20% of sequential rows from a point that's 10% from the end of an ID's number of rows.)
What dfcombine
looks like from the first code chunk below, only instead of randomly selected fruit
rows within an id
, the fruit
rows will have only a random start-point, with the subsequent rows needed for the sample following the start-point row sequentially.
I can pull part of this problem off with the following code -- but it selects all rows at random, and I need the sample chunks to be sequential following the random start point (FYI: if you run this, you'll see your chunks start at 6% b/c this is a small dataset -- no rows <6% of sample-per-id):
library(tidyverse)
set.seed(123) # pick same sample each time
dflist<-list() # make an empty list
for (i in 1:100) # "do i a hundred times"
{
i.2<-i/100 # i.2 is i/100
dflooped <- df %>% # new df
group_by(id) %>% # group by id
sample_frac(i.2,replace=TRUE) # every i.2, take a random sample
dflooped
dflist[[i]]<-dflooped
}
dflist # check
library(data.table)
dfcombine <- rbindlist(dflist, idcol = "id%") # put the list elements in a df
I can also pick the sequentially larger chunks I'm looking for with this -- but it doesn't allow me the random start (it always goes from the beginning of the df):
lapply(seq(.01,.1,.01), function(i) df[1:(nrow(df)*i),])
and using dplyr group_by
spits an error I don't understand:
df2 <- df %>%
group_by(id) %>%
lapply(seq(.01,1,.01), function(i) df[1:(nrow(df)*i),])
Error in match.fun(FUN) :
'seq(0.01, 1, 0.01)' is not a function, character or symbol
So I may have some of the pieces, but am having trouble putting them together -- the solution may or may not include what I've done above. Thanks.
Upvotes: 2
Views: 256
Reputation: 906
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)
adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)
I question the statistical impact of only starting your random sample from a spot where you won't "run out" of observations within this ID category.
Would it not be better to loop back to the top of the records within each ID category if you were to run out? That would ensure a uniform chance of beginning your sample within any portion of a specific ID field as opposed to limiting yourself to only within the first 80% of the data if we want a 20% sample size. Just a thought! I built this as you asked though!
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {
#browser()
# subset the data frame for the ID we're currently interested in
p_df <- p_df[ p_df[, p_idname] == p_idvalue, ]
# calculate number of rows we need in order to sample _% of the data within this ID
nrows_to_sample <- floor(p_sampleperc * nrow(p_df))
# calculate a single random number to serve as our start point somewhere between:
# 1 and the (number of rows - (number of rows to sample + 1)) -- the plus 1
# is to add a cushion and avoid issues
start_samp_indx <- as.integer(runif(1, 1, (nrow(p_df) - (nrows_to_sample + 1) )))
# sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
return(p_df[all_samp_indx,])
}
Test out the function with just a single sample for a certain percent (10% here). This is also a good way to redo several of the same function call to ensure a randomized starting location.
# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)
Set aside a unique list of all potential values within the id
field. Also set aside a vector of sample sizes in percent format (between 0 and 1).
# capture all possible values in id field
possible_ids <- unique(df$id)
# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:length(possible_ids) / 10)
# initialize list:
combined_list <- list()
for(i in 1:length(possible_ids)) {
#browser()
print(paste0("Now sampling ", sampleperc_sequence[i], " from ", possible_ids[i]))
combined_list[[i]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[i])
}
# process results of for loop
combined_list
# number of rows in each df in our list
sapply(combined_list, nrow)
# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)
I'll leave what I initially wrote up there, but in retrospect, I think this is actually a bit closer to what you are asking for.
This solution uses the same type of function, but I used nested for loops to achieve what you were asking for.
For each ID, it will:
Code:
df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)
# adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)
# function to do what you want:
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {
# subset the data frame for the ID we're currently interested in
p_df <- p_df[ p_df[, p_idname] == p_idvalue, ]
# calculate number of rows we need in order to sample _% of the data within this ID
nrows_to_sample <- floor(p_sampleperc * nrow(p_df))
# don't let us use zero as an index
if(nrows_to_sample < 1) {
nrows_to_sample <- 1
}
# calculate a single random number to serve as our start point somewhere between:
# 1 and the (number of rows - (number of rows to sample + 1)) -- the plus 1
# is to add a cushion and avoid issues
start_samp_indx <- as.integer(runif(1, 1, (nrow(p_df) - nrows_to_sample )))
# sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
return(p_df[all_samp_indx,])
}
# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)
# now put this bad boy in a for loop -- put these in order of what IDs match what sequence
possible_ids <- unique(df$id)
# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:99 / 100)
# adding an expand grid
ids_sample <- expand.grid(possible_ids, sampleperc_sequence)
# initialize list:
combined_list <- list()
counter <- 1
for(i in 1:length(possible_ids)) {
for(j in 1:length(sampleperc_sequence)) {
print(paste0("Now sampling ", (sampleperc_sequence[j] * 100), "% from ", possible_ids[i]))
combined_list[[counter]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[j])
# manually keep track of counter
counter <- counter + 1
}
}
random_start_seq_sample(df, 'id', possible_ids[1], sampleperc_sequence[91])
# process results of for loop
combined_list
# check size of first list element
combined_list[[1]] # A, 10% sample is 1 record
# check thirtieth element
combined_list[[30]] # A, 30% sample is 3 records
# check size of the sixtieth list element
combined_list[60] # A, 60% sample is 6 records
sapply(combined_list, nrow) # number of rows in each df in our list
# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)
Upvotes: 1