R: Create Data Partition with extra term

Question

I have the following data.frame (which is longer then the following example)

sub height  group
1   1.55    a
2   1.65    a
3   1.76    b
4   1.77    a
5   1.58    c
6   1.65    d
7   1.82    c
8   1.91    c
9   1.77    b
10  1.69    b
11  1.74    a
12  1.75    c

Im making a data partition with the following code:

library("caret")
train = createDataPartition(df$group, p = 0.50)
partition = df[train, ]

So it takes a subject with the probability of 0.5 from each group. My problem is in this following example is that sometimes a subject from group d will be picked and sometimes not (because group d is really small). I want to create a constraint that in every partition I make, atlist 1 subject from EVERY group will be picked.

Any graceful solution?

I came up with a not-so graceful solution looking like this:

allGroupSamles <- c()
for (i in unique(df$groups))
{
  allGroupSamles <- c(allGroupSamles , sample(rownames(df[df$groups == i, ]) , 1, replace = TRUE))
}
allGroupSamles <- as.integer(allGroupSamles )

train = createDataPartition(df$groups, p = 0.50)[[1]]
train <- c(allGroupSamles , train)

partition= df[unique(train), ]

Zelazny7 · Accepted Answer

You can use split on a data.frame and sample within each group taking half of the records or 1, whichever is greater:

# apply a function over the split data.frame
samples <- lapply(split(df, df$group), function(x) {

  # the function takes a random sample of half the records in each group
  # by using `ceiling`, it guarantees at least one record
  s <- sample(nrow(x), ceiling(nrow(x)/2))
  x[s,]
})

train <- do.call(rbind, samples)

Edit:

If you need a numeric vector:

s <- tapply(1:nrow(df), df$group, function(x) {
  sample(x, ceiling(length(x)/2))
})

do.call(c, s)

R: Create Data Partition with extra term

Answers (1)

Edit:

Related Questions