Reputation: 4309
Following up this question, I wondered how I can effectively sample a stratified Person Period file.
I have a database who looks like this
id time var clust
1: 1 1 a clust1
2: 1 2 c clust1
3: 1 3 c clust1
4: 2 1 a clust1
5: 2 2 a clust1
...
With individuals id
grouped into clusters clust
. What I would like is to sample id
by clust
, keeping the person period format.
The solution I came up with is to sample id
and then to merge
back. However, is it not a very elegant solution.
library(data.table)
library(dplyr)
setDT(dt)
dt[,.SD[sample(.N,1)],by = clust] %>%
merge(., dt, by = 'id')
which gives
id clust.x time.x var.x time.y var.y clust.y
1: 2 clust1 1 a 1 a clust1
2: 2 clust1 1 a 2 a clust1
3: 2 clust1 1 a 3 c clust1
4: 3 clust2 3 c 1 a clust2
5: 3 clust2 3 c 2 b clust2
6: 3 clust2 3 c 3 c clust2
7: 5 clust3 1 a 1 a clust3
8: 5 clust3 1 a 2 a clust3
9: 5 clust3 1 a 3 c clust3
Is there a more straightforward solution ?
library(data.table)
dt = setDT(structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("1", "2",
"3", "4", "5", "6"), class = "factor"), time = structure(c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), .Label = c("1", "2", "3"), class = "factor"), var = structure(c(1L,
3L, 3L, 1L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L, 3L, 2L, 2L,
3L), .Label = c("a", "b", "c"), class = "factor"), clust = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L,
2L), .Label = c("clust1", "clust2", "clust3"), class = "factor")), .Names = c("id",
"time", "var", "clust"), row.names = c(NA, -18L), class = "data.frame"))
Upvotes: 1
Views: 181
Reputation: 215057
Here is a variant following @Frank's comment that might help, essentially you can sample a unique id from each clust
group and find out the corresponding index number with .I
for subsetting:
dt[dt[, .I[id == sample(unique(id),1)], clust]$V1]
# id time var clust
#1: 2 1 a clust1
#2: 2 2 a clust1
#3: 2 3 c clust1
#4: 3 1 a clust2
#5: 3 2 b clust2
#6: 3 3 c clust2
#7: 4 1 a clust3
#8: 4 2 b clust3
#9: 4 3 c clust3
Upvotes: 3
Reputation: 66819
I think tidy data here would have an ID table where cluster is an attribute:
idDT = unique(dt[, .(id, clust)])
id clust
1: 1 clust1
2: 2 clust1
3: 3 clust2
4: 4 clust3
5: 5 clust3
6: 6 clust2
From there, sample...
my_selection = idDT[, .(id = sample(id, 1)), by=clust]
and merge or subset
dt[ my_selection, on=names(my_selection) ]
# or
dt[ id %in% my_selection$id ]
I would keep the intermediate table my_selection
around, expecting it to come in handy later.
Upvotes: 2