Conditional random sampling

Question

I would need to do a conditional random sampling, but I'm not sure how to achieve this... so any help would be MUCH appreciated :) Let's assume my dataframe is the following:

df <- data.frame(newspaper = sample(c("Newspaper 1", "Newspaper 2", "Newspaper 3", "Newspaper 4"), 90, replace = TRUE), event = sample(c("Event 1", "Event 2", "Event 3", "Event 4", "Event 5"), 90, replace = TRUE), article = sample(c(0:1), 90, replace = TRUE))
df <- subset(df, article >0)

[article = 1 means that there is an article. Would be the title of the actual article in the real dataset]

I would basically need to pick two random articles when there are more than two for each combination of newspaper + event, and keep all articles otherwise. I'm quite unsure how to build the loop to get this... any idea? Thanks! Fred

Ronak Shah · Accepted Answer

We can group_by newspaper and event and if there are more than 2 rows in a group then select random 2 rows or else select all the rows.

library(dplyr)

df %>%
  group_by(newspaper, event) %>%
  slice(if(n() > 2) sample(1:n(), 2) else 1:n())

# newspaper   event   article
#               
# 1 Newspaper 1 Event 1       1
# 2 Newspaper 1 Event 1       1
# 3 Newspaper 1 Event 2       1
# 4 Newspaper 1 Event 2       1
# 5 Newspaper 1 Event 3       1
# 6 Newspaper 1 Event 3       1
# 7 Newspaper 1 Event 4       1
# 8 Newspaper 1 Event 4       1
# 9 Newspaper 2 Event 1       1
#10 Newspaper 2 Event 2       1
# … with 24 more rows

Or we can avoid the if condition by using pmin where it selects minimum value to sample between 2 or number of rows in the group.

df %>%
  group_by(newspaper, event) %>%
  slice(sample(1:n(), pmin(2, n())))

Conditional random sampling

Answers (1)

Related Questions