Reputation: 330
I would need to do a conditional random sampling, but I'm not sure how to achieve this... so any help would be MUCH appreciated :) Let's assume my dataframe is the following:
df <- data.frame(newspaper = sample(c("Newspaper 1", "Newspaper 2", "Newspaper 3", "Newspaper 4"), 90, replace = TRUE), event = sample(c("Event 1", "Event 2", "Event 3", "Event 4", "Event 5"), 90, replace = TRUE), article = sample(c(0:1), 90, replace = TRUE))
df <- subset(df, article >0)
[article = 1 means that there is an article. Would be the title of the actual article in the real dataset]
I would basically need to pick two random articles when there are more than two for each combination of newspaper
+ event
, and keep all articles otherwise.
I'm quite unsure how to build the loop to get this... any idea?
Thanks!
Fred
Upvotes: 0
Views: 397
Reputation: 389235
We can group_by
newspaper
and event
and if
there are more than 2 rows in a group then select random 2 rows or else
select all the rows.
library(dplyr)
df %>%
group_by(newspaper, event) %>%
slice(if(n() > 2) sample(1:n(), 2) else 1:n())
# newspaper event article
# <fct> <fct> <int>
# 1 Newspaper 1 Event 1 1
# 2 Newspaper 1 Event 1 1
# 3 Newspaper 1 Event 2 1
# 4 Newspaper 1 Event 2 1
# 5 Newspaper 1 Event 3 1
# 6 Newspaper 1 Event 3 1
# 7 Newspaper 1 Event 4 1
# 8 Newspaper 1 Event 4 1
# 9 Newspaper 2 Event 1 1
#10 Newspaper 2 Event 2 1
# … with 24 more rows
Or we can avoid the if
condition by using pmin
where it selects minimum value to sample between 2 or number of rows in the group.
df %>%
group_by(newspaper, event) %>%
slice(sample(1:n(), pmin(2, n())))
Upvotes: 1