giac
giac

Reputation: 4309

R - stratified sampling for Person Period file

Following up this question, I wondered how I can effectively sample a stratified Person Period file.

I have a database who looks like this

    id time var  clust
 1:  1    1   a clust1
 2:  1    2   c clust1
 3:  1    3   c clust1
 4:  2    1   a clust1
 5:  2    2   a clust1
...

With individuals id grouped into clusters clust. What I would like is to sample id by clust, keeping the person period format.

The solution I came up with is to sample id and then to merge back. However, is it not a very elegant solution.

library(data.table) 
library(dplyr) 

setDT(dt) 

dt[,.SD[sample(.N,1)],by = clust] %>% 
  merge(., dt, by = 'id')

which gives

   id clust.x time.x var.x time.y var.y clust.y
1:  2  clust1      1     a      1     a  clust1
2:  2  clust1      1     a      2     a  clust1
3:  2  clust1      1     a      3     c  clust1
4:  3  clust2      3     c      1     a  clust2
5:  3  clust2      3     c      2     b  clust2
6:  3  clust2      3     c      3     c  clust2
7:  5  clust3      1     a      1     a  clust3
8:  5  clust3      1     a      2     a  clust3
9:  5  clust3      1     a      3     c  clust3

Is there a more straightforward solution ?

library(data.table)
dt = setDT(structure(list(id = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("1", "2", 
"3", "4", "5", "6"), class = "factor"), time = structure(c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
 3L), .Label = c("1", "2", "3"), class = "factor"), var = structure(c(1L, 
3L, 3L, 1L, 1L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L, 3L, 2L, 2L, 
3L), .Label = c("a", "b", "c"), class = "factor"), clust = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 
2L), .Label = c("clust1", "clust2", "clust3"), class = "factor")), .Names =  c("id", 
 "time", "var", "clust"), row.names = c(NA, -18L), class = "data.frame"))

Upvotes: 1

Views: 181

Answers (2)

akuiper
akuiper

Reputation: 215057

Here is a variant following @Frank's comment that might help, essentially you can sample a unique id from each clust group and find out the corresponding index number with .I for subsetting:

dt[dt[, .I[id == sample(unique(id),1)], clust]$V1]

#   id time var  clust
#1:  2    1   a clust1
#2:  2    2   a clust1
#3:  2    3   c clust1
#4:  3    1   a clust2
#5:  3    2   b clust2
#6:  3    3   c clust2
#7:  4    1   a clust3
#8:  4    2   b clust3
#9:  4    3   c clust3

Upvotes: 3

Frank
Frank

Reputation: 66819

I think tidy data here would have an ID table where cluster is an attribute:

idDT = unique(dt[, .(id, clust)])


   id  clust
1:  1 clust1
2:  2 clust1
3:  3 clust2
4:  4 clust3
5:  5 clust3
6:  6 clust2

From there, sample...

my_selection = idDT[, .(id = sample(id, 1)), by=clust]

and merge or subset

dt[ my_selection, on=names(my_selection) ]
# or 
dt[ id %in% my_selection$id ]

I would keep the intermediate table my_selection around, expecting it to come in handy later.

Upvotes: 2

Related Questions