decaper
decaper

Reputation: 55

Sampling progressively larger chunks of sequential rows with random starts per ID

Example Data

df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE))

Problem

Pick a random start point within each id and from that point, select that row and subsequent, sequential rows totaling 1% of the rows within that ID. Then do it again for 2% of each ID's rows, and 3% and so on up to 99% of the rows per ID. Also, do not select a random point to begin sampling that is closer to the end of the ID's rows than the percentage desired to be samples (i.e., don't try to sample 20% of sequential rows from a point that's 10% from the end of an ID's number of rows.)

Desired Result

What dfcombine looks like from the first code chunk below, only instead of randomly selected fruit rows within an id, the fruit rows will have only a random start-point, with the subsequent rows needed for the sample following the start-point row sequentially.

What I've Tried

I can pull part of this problem off with the following code -- but it selects all rows at random, and I need the sample chunks to be sequential following the random start point (FYI: if you run this, you'll see your chunks start at 6% b/c this is a small dataset -- no rows <6% of sample-per-id):

library(tidyverse)

set.seed(123) # pick same sample each time

dflist<-list() # make an empty list

for (i in 1:100) # "do i a hundred times"

{

  i.2<-i/100 # i.2 is i/100
  dflooped <- df %>% # new df
    group_by(id) %>% # group by id
    sample_frac(i.2,replace=TRUE)  # every i.2, take a random sample
  dflooped 
  dflist[[i]]<-dflooped 
}
dflist # check

library(data.table)

dfcombine <- rbindlist(dflist, idcol = "id%") # put the list elements in a df

I can also pick the sequentially larger chunks I'm looking for with this -- but it doesn't allow me the random start (it always goes from the beginning of the df):

lapply(seq(.01,.1,.01), function(i) df[1:(nrow(df)*i),])

and using dplyr group_by spits an error I don't understand:

df2 <- df %>%
  group_by(id) %>%
  lapply(seq(.01,1,.01), function(i) df[1:(nrow(df)*i),])

Error in match.fun(FUN) : 
  'seq(0.01, 1, 0.01)' is not a function, character or symbol

So I may have some of the pieces, but am having trouble putting them together -- the solution may or may not include what I've done above. Thanks.

Upvotes: 2

Views: 256

Answers (1)

TaylorV
TaylorV

Reputation: 906

Sequential sampling within ID

Create fake data

df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)

adding a more unique data element to test data for testing sampling

df$random_numb <- round(runif(nrow(df), 1, 100), 2)

Here we'll define a function to do what you want:

I question the statistical impact of only starting your random sample from a spot where you won't "run out" of observations within this ID category.

Would it not be better to loop back to the top of the records within each ID category if you were to run out? That would ensure a uniform chance of beginning your sample within any portion of a specific ID field as opposed to limiting yourself to only within the first 80% of the data if we want a 20% sample size. Just a thought! I built this as you asked though!

random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {

    #browser()

    # subset the data frame for the ID we're currently interested in
    p_df <- p_df[  p_df[, p_idname] == p_idvalue,  ]


    # calculate number of rows we need in order to sample _% of the data within this ID
    nrows_to_sample <- floor(p_sampleperc * nrow(p_df))


    # calculate a single random number to serve as our start point somewhere between:
        # 1 and the (number of rows - (number of rows to sample + 1))  --  the plus 1 
        # is to add a cushion and avoid issues
    start_samp_indx <- as.integer(runif(1,  1, (nrow(p_df) - (nrows_to_sample + 1)  )))


    # sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
    all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
    return(p_df[all_samp_indx,])
}

Test function for a single function call

Test out the function with just a single sample for a certain percent (10% here). This is also a good way to redo several of the same function call to ensure a randomized starting location.

# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)

Now place function in for loop

Set aside a unique list of all potential values within the id field. Also set aside a vector of sample sizes in percent format (between 0 and 1).

# capture all possible values in id field
possible_ids <- unique(df$id)

# these values need to be between 0 and 1 (10% == 0.1)
sampleperc_sequence <- (1:length(possible_ids) / 10)  


# initialize list:
combined_list <- list()


for(i in 1:length(possible_ids)) {
    #browser()

    print(paste0("Now sampling ", sampleperc_sequence[i], " from ", possible_ids[i]))
    combined_list[[i]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[i])
}

Process the results

# process results of for loop
combined_list

# number of rows in each df in our list
sapply(combined_list, nrow)  

This is the resulting dataset of all combinations of samples

# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)

EDIT:

I'll leave what I initially wrote up there, but in retrospect, I think this is actually a bit closer to what you are asking for.

This solution uses the same type of function, but I used nested for loops to achieve what you were asking for.

For each ID, it will:

  • subset dataframe for this ID value
  • find random starting point
  • sample n% of the data (starting with 1%)
  • repeat with +1% to n (up to 99%)

Code:

df <- data.frame(id=rep(LETTERS, each=10)[1:50], fruit=sample(c("apple", "orange", "banana"), 50, TRUE), stringsAsFactors = F)

# adding a more unique data element to test data for testing sampling
df$random_numb <- round(runif(nrow(df), 1, 100), 2)





# function to do what you want:
random_start_seq_sample <- function(p_df, p_idname, p_idvalue, p_sampleperc) {


    # subset the data frame for the ID we're currently interested in
    p_df <- p_df[  p_df[, p_idname] == p_idvalue,  ]


    # calculate number of rows we need in order to sample _% of the data within this ID
    nrows_to_sample <- floor(p_sampleperc * nrow(p_df))


    # don't let us use zero as an index
    if(nrows_to_sample < 1) {
        nrows_to_sample <- 1
    }


    # calculate a single random number to serve as our start point somewhere between:
        # 1 and the (number of rows - (number of rows to sample + 1))  --  the plus 1 
        # is to add a cushion and avoid issues
    start_samp_indx <- as.integer(runif(1,  1, (nrow(p_df) - nrows_to_sample  )))


    # sample our newly subset dataframe for what we need (nrows to sample minus 1) and return
    all_samp_indx <- start_samp_indx:(start_samp_indx + (nrows_to_sample - 1))
    return(p_df[all_samp_indx,])
}





# single test: give me 40% of the columns with 'A' in the 'id' field:
random_start_seq_sample(df, 'id', 'A', 0.1)





# now put this bad boy in a for loop -- put these in order of what IDs match what sequence
    possible_ids <- unique(df$id)

    # these values need to be between 0 and 1 (10% == 0.1)
    sampleperc_sequence <- (1:99 / 100)  

    # adding an expand grid
    ids_sample <- expand.grid(possible_ids, sampleperc_sequence)



# initialize list:
combined_list <- list()
counter <- 1

for(i in 1:length(possible_ids)) {
    for(j in 1:length(sampleperc_sequence)) {
        print(paste0("Now sampling ", (sampleperc_sequence[j] * 100), "% from ", possible_ids[i]))
        combined_list[[counter]] <- random_start_seq_sample(df, 'id', possible_ids[i], sampleperc_sequence[j])

        # manually keep track of counter
        counter <- counter + 1
    }


}


random_start_seq_sample(df, 'id', possible_ids[1], sampleperc_sequence[91])


# process results of for loop
combined_list

    # check size of first list element
    combined_list[[1]]  # A, 10% sample is 1 record


    # check thirtieth element
    combined_list[[30]] # A, 30% sample is 3 records


    # check size of the sixtieth list element
    combined_list[60]   # A, 60% sample is 6 records





sapply(combined_list, nrow)  # number of rows in each df in our list


# cross reference the numeric field with the original data frame to make sure we had random starting points
dfcombined <- do.call(rbind, combined_list)

Upvotes: 1

Related Questions