Reputation: 627
My dataset is a series of surveys. Each survey is divided up into several time periods and each time period has several observations. Each line in the dataset is a single observation. It looks something like this:
Survey Period Observation
1.1 1 A
1.1 1 A
1.1 1 B
1.1 2 A
1.1 2 B
1.2 1 A
1.2 2 B
1.2 3 C
1.2 4 D
This is a simplified version of my dataset, but it demonstrates the point (several periods for each survey, several observations for each period). What I want to do is make a dataframe consisting of all the observations from a single, randomly selected, period in each survey, so that in the resulting dataframe each survey only has a single period, but all of the associated observations. I'm completely stumped on this one and don't even know where to start.
Thanks for your help
Upvotes: 1
Views: 68
Reputation: 526
You can achieve what you need in a straigth forward way using plain vanilla base R doing something like this:
out = d[0,] # make empty dataframe with similar structure.
for( survey in levels( as.factor( d$Survey ) ) ) { # for each value of survey
# randomly choose 1 from the observed values of Period for this value of Survey:
period = sample( d[ d$Survey == survey, ]$Period, 1 )
# attach all rows with that survey and that period to the empty df above
out = rbind( out, d[ d$Survey == survey & d$Period == period, ] )
}
Upvotes: 1
Reputation: 16121
If I've understood correctly, for each survey you need to randomly select one period only and then get all corresponding observations.
There might alternative ways, but I'm using a dplyr
approach.
dt = read.table(text="Survey Period Observation
1.1 1 A
1.1 1 A
1.1 1 B
1.1 2 A
1.1 2 B
1.2 1 A
1.2 2 B
1.2 3 C
1.2 4 D", header=T)
library(dplyr)
set.seed(49) ## just to be able to replicate the process exactly
dt %>%
select(Survey, Period) %>% ## select relevant columns
distinct() %>% ## keep unique combinations
group_by(Survey) %>% ## for each survey
sample_n(1) %>% ## sample only one period
ungroup() %>% ## forget about the grouping
inner_join(dt, by=c("Survey","Period")) ## get corresponding observations
# Survey Period Observation
# (dbl) (int) (fctr)
# 1 1.1 1 A
# 2 1.1 1 A
# 3 1.1 1 B
# 4 1.2 2 B
Upvotes: 2