Reputation: 5897
I am working with the R programming language.
Suppose there is a classroom of students - each student flips the same coin many times (the students don't flip the coin the same number of times). Here is a simulate dataset to represent this example:
library(tidyverse)
library(dplyr)
set.seed(123)
ids = 1:100
student_id = sample(ids, 1000, replace = TRUE)
coin_result = sample(c("H", "T"), 1000, replace = TRUE)
my_data = data.frame(student_id, coin_result)
my_data = my_data[order(my_data$student_id),]
Here is some R code that performs these tasks:
results = my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
final = results %>%
mutate(two_seq = substr(Sequence, 1, 2)) %>%
group_by(two_seq) %>%
mutate(third = substr(Sequence, 3, 3)) %>%
group_by(two_seq, third) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums))
My Question: Suppose I want to now extend this problem to "4 Flip Sequences" (e.g. probability of H given HHH) - I can manually extend this code:
results = my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, 2), lead(coin_result, 3)), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence)
final = results %>%
mutate(three_seq = substr(Sequence, 1, 3)) %>%
group_by(three_seq) %>%
mutate(fourth = substr(Sequence, 4, 4)) %>%
group_by(three_seq, fourth) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums))
Is it possible to convert the above code into a function such that I can repeat this for arbitrary combinations? For example:
results <- function(i) {return(my_data %>%
group_by(student_id) %>%
summarize(Sequence = str_c(coin_result, lead(coin_result), lead(coin_result, i+1), lead(coin_result, i+2) .....### insert code here ####), .groups = 'drop') %>%
filter(!is.na(Sequence)) %>%
count(Sequence))}
final <- function(i)
return(results %>%
mutate(three_seq = substr(Sequence, 1, i)) %>%
group_by(three_seq) %>%
mutate(fourth = substr(Sequence, i+1, i+1)) %>%
group_by(three_seq, fourth) %>%
summarize(sums = sum(n)) %>%
mutate(prob = sums / sum(sums)))
}
I am not sure how exactly I would do this, seeing as the first function would require to be "dynamically changed" depending on the value of "i".
Can someone please show me how to do this?
Thanks!
Upvotes: 0
Views: 69
Reputation: 1180
Here's a way you can do it in base R:
# Returns a vector of 0's and 1's, bit more efficient than sample
tosses <- floor(runif(1e3, 0, 2))
count_seqs <- function(x, seq_length) {
vec_length <- length(x)
rolling_window_indices <- rep(1:seq_length, vec_length - seq_length + 1) +
rep(0:(vec_length - seq_length), each = seq_length)
mat <- matrix(x[rolling_window_indices], nrow = seq_length)
sequences <- apply(mat, 2, paste0, collapse = "")
table(sequences)
}
count_seqs(tosses, 3)
Notice I didn't include any ids
in the code above. The reason is that, if all students have the same probability of tossing heads or tails, we can treat them as independent (or, more precisely, treat the design as ignorable). However, it's easy to expand the code for situations where the tosses are not independent, e.g. where each participant has a different probability of tossing heads:
ids <- floor(runif(1e3, 1, 101))
probs <- runif(1e2, 0, 1)
tosses_by_id <- lapply(ids, function(i) rbinom(10, 1, probs[i]))
lapply(tosses_by_id, function(x) count_seqs(x, 3))
Upvotes: 1