C. Denney
C. Denney

Reputation: 627

Selecting random portions of a dataframe

My dataset is a series of surveys. Each survey is divided up into several time periods and each time period has several observations. Each line in the dataset is a single observation. It looks something like this:

Survey     Period     Observation
  1.1        1            A
  1.1        1            A
  1.1        1            B
  1.1        2            A
  1.1        2            B
  1.2        1            A
  1.2        2            B
  1.2        3            C
  1.2        4            D

This is a simplified version of my dataset, but it demonstrates the point (several periods for each survey, several observations for each period). What I want to do is make a dataframe consisting of all the observations from a single, randomly selected, period in each survey, so that in the resulting dataframe each survey only has a single period, but all of the associated observations. I'm completely stumped on this one and don't even know where to start.

Thanks for your help

Upvotes: 1

Views: 68

Answers (2)

jtatria
jtatria

Reputation: 526

You can achieve what you need in a straigth forward way using plain vanilla base R doing something like this:

out = d[0,] # make empty dataframe with similar structure.
for( survey in levels( as.factor( d$Survey ) ) ) { # for each value of survey
  # randomly choose 1 from the observed values of Period for this value of Survey:
  period = sample( d[ d$Survey == survey, ]$Period, 1 )
  # attach all rows with that survey and that period to the empty df above
  out = rbind( out, d[ d$Survey == survey & d$Period == period, ] )
}

Upvotes: 1

AntoniosK
AntoniosK

Reputation: 16121

If I've understood correctly, for each survey you need to randomly select one period only and then get all corresponding observations. There might alternative ways, but I'm using a dplyr approach.

dt = read.table(text="Survey     Period     Observation
                1.1        1            A
                1.1        1            A
                1.1        1            B
                1.1        2            A
                1.1        2            B
                1.2        1            A
                1.2        2            B
                1.2        3            C
                1.2        4            D", header=T)

library(dplyr)

set.seed(49)  ## just to be able to replicate the process exactly

dt %>%
  select(Survey, Period) %>%               ## select relevant columns
  distinct() %>%                           ## keep unique combinations
  group_by(Survey) %>%                     ## for each survey
  sample_n(1) %>%                          ## sample only one period
  ungroup() %>%                            ## forget about the grouping
  inner_join(dt, by=c("Survey","Period"))  ## get corresponding observations

#    Survey Period Observation
#     (dbl)  (int)      (fctr)
# 1    1.1      1           A
# 2    1.1      1           A
# 3    1.1      1           B
# 4    1.2      2           B

Upvotes: 2

Related Questions