Reputation: 77
First, I'm trying to subsample a large dataset with many individuals, but each individual requires a different subsample size. I'm comparing across two time periods, so I want to subsample each individual by the minimum data points each has across the two periods. Second, I have multiple metrics (mostly various means) to calculate per individual, per time period (I've provided one example below). Third, I want to bootstrap 1,000 reps for those metrics. I also want to do this for the population (by averaging across individuals). I have an example of what I've tried below, but that may be way off. I'm open to functions or for loops - I can't conceptualize which is better for this question. (I apologize ahead of time if my code is not efficient - I'm self taught from googling.)
# Example dataset
Data <- data.frame(
ID = sample(c("A", "B", "C", "D"), 50, replace = TRUE),
Act = sample(c("eat", "sleep", "play"), 50, replace = TRUE),
Period = sample(c("pre", "post"), 50, replace = TRUE)
)
# Separate my data by period
DataPre <- as.data.frame(Data[ which(Data $Period == "pre"), ])
DataPost <- as.data.frame(Data[ which(Data $Period == "post"), ])
# Get the minimum # observations for each ID across both periods
Num <- Data %>%
group_by(ID, Period) %>%
summarise(number=n()) %>%
group_by(ID) %>%
summarise(min=min(number))
# Function to get the mean proportion per ID
meanAct <- function(x){
x %>%
group_by(ID, Act) %>%
summarise (n = n()) %>%
mutate(freq = n / sum(n))
}
Below is how I would subsample if just ONE ID (not many different with different subsampling requirements). I don't know how to specify to subsample different amounts by ID and then replicate each.
# See "8888" Here I want to subsample the Num$Min for each ID
DataResults <- function(x, rep){
reps <- replicate(rep, meanAct(x[sample(1:nrow(x), 8888, replace=FALSE),]))
meanfreq <- apply(simplify2array(reps[3, 1:2]), 1, mean)
sd <- apply(simplify2array(reps[3, 1:2]), 1, sd)
lower <- meanfreq - 1.96*(sd/sqrt(8888))
upper <- meanfreq + 1.96*(sd/sqrt(8888))
meanAct <- as.vector(reps[[1]])
output <- data.frame(meanAct, meanfreq, sd, lower, upper)
print(output)
}
# Print results
DataResults(DataPre, 1000)
DataResults(DataPost, 1000)
# Somehow I get the mean for the population by averaging across all IDs
DataMeanGroup <- DataMean %>%
group_by(Period) %>%
summarise (mean = mean(prop))
The results I'm looking for are the means for each activity for each individual based on subsampling (by minimum datapoints PER INDIVIDUAL) and bootstrapping 1000 reps. Also, if possible, the overall mean for the population by averaging across individuals (again from subsampling and bootstrapping).
EDIT: Additional information: The ultimate result should allow me to compare the proportion of time that each ID does each activity across the two time periods (e.g. compare % time that A spends eating in the pre vs post, etc). But, subsampled for the period with too much data so that we're comparing an equal number of observations. The way the code would run in my head would be (1) subsample the observations so that we’re comparing an equal number of observations for each ID across the two periods, (2) calculate the proportion of each activity for each ID in each time period, (3) repeat that subsample calculation 1,000 times so that the proportion we end up with is representative of the total observations.
Upvotes: 1
Views: 826
Reputation: 107652
Consider generalizing your subsampling function to receive subsets of data frame passed in by
which can slice data frame by each unique pairing of ID and Period. But first calculate MinNum
by each ID and Period using ave
(inline aggregation). All code below uses base R (i.e., no other package):
Data and Functions
# Example dataset (WITH MORE ROWS)
set.seed(11919)
Data <- data.frame(
ID = sample(c("A", "B", "C", "D"), 500, replace = TRUE),
Act = sample(c("eat", "sleep", "play"), 500, replace = TRUE),
Period = sample(c("pre", "post"), 500, replace = TRUE)
)
# MIN NUM PER ID AND PERIOD GROUPING (NESTED ave FOR COUNT AND MIN AGGREGATIONS)
Data$Min_Num <- with(Data, ave(ave(1:nrow(Data), ID, Period, FUN=length), ID, FUN=min))
# Function to get the mean proportion per ID
meanAct <- function(x){
within(x, {
n <- ave(1:nrow(x), ID, Act, FUN=length)
freq <- n / sum(n)
})
}
DataResults <- function(df, rep){
reps <- replicate(rep, meanAct(df[sample(1:nrow(df), df$Min_Num[1], replace=FALSE),]))
mean_freq <- apply(simplify2array(reps["freq", ]), 1, mean) # ADJUSTED [] INDEXING
sd <- apply(simplify2array(reps["freq", ]), 1, sd) # ADJUSTED [] INDEXING
lower <- mean_freq - 1.96*(sd/sqrt(df$Min_Num[1]))
upper <- mean_freq + 1.96*(sd/sqrt(df$Min_Num[1]))
mean_act <- as.vector(reps[[2]]) # ADJUSTED [[#]] NUMBER
id <- df$ID[1] # ADD GROUP INDICATOR
period <- df$Period[1] # ADD GROUP INDICATOR
output <- data.frame(id, period, mean_act, mean_freq, sd, lower, upper)
return(output)
}
Processing
# BY CALL
df_list <- by(Data, Data[c("ID", "Period")], function(sub) DataResults(sub, 1000))
# BIND ALL DFs INTO ONE DF
final_df <- do.call(rbind, df_list)
head(final_df, 10)
# id period mean_act mean_freq sd lower upper
# 1 A post sleep 0.02157354 0.005704140 0.01992512 0.02322196
# 2 A post eat 0.02151701 0.005720058 0.01986399 0.02317003
# 3 A post sleep 0.02171393 0.005808156 0.02003546 0.02339241
# 4 A post eat 0.02164184 0.005716603 0.01998982 0.02329386
# 5 A post play 0.02174095 0.005678416 0.02009996 0.02338193
# 6 A post eat 0.02181380 0.005716590 0.02016178 0.02346581
# 7 A post sleep 0.02172458 0.005691051 0.02007995 0.02336922
# 8 A post sleep 0.02174288 0.005666839 0.02010524 0.02338052
# 9 A post play 0.02166234 0.005673047 0.02002291 0.02330177
# 10 A post play 0.02185057 0.005813680 0.02017050 0.02353065
Summarization
# SUMMARIZE FINAL DF (MEAN PROP BY ID AND ACT)
agg_df <- aggregate(mean_freq ~ id + mean_act, final_df, mean)
agg_df
# id mean_act mean_freq
# 1 A eat 0.02172782
# 2 B eat 0.01469706
# 3 C eat 0.01814771
# 4 D eat 0.01696995
# 5 A play 0.02178283
# 6 B play 0.01471497
# 7 C play 0.01819898
# 8 D play 0.01688828
# 9 A sleep 0.02169912
# 10 B sleep 0.01470978
# 11 C sleep 0.01818944
# 12 D sleep 0.01697438
# SUMMARIZE FINAL DF (MEAN PROP BY ID AND PERIOD)
agg_df <- aggregate(mean_freq ~ id + period, final_df, mean)
agg_df
# id period mean_freq
# 1 A post 0.02173913
# 2 B post 0.01470588
# 3 C post 0.01818182
# 4 D post 0.01694915
# 5 A pre 0.02173913
# 6 B pre 0.01470588
# 7 C pre 0.01818182
# 8 D pre 0.01694915
# SUMMARIZE FINAL DF (MEAN PROP BY ID)
agg_df <- aggregate(mean_freq ~ id, final_df, mean)
agg_df
# id mean_freq
# 1 A 0.02173913
# 2 B 0.01470588
# 3 C 0.01818182
# 4 D 0.01694915
Upvotes: 2