user1165199
user1165199

Reputation: 6649

Get random sample from subset of other dataframe

I have a large data frame of 100,000's rows, and I want to add a column where the value is a sample of a subset of another data frame based on common names in the data frames. Might be easier to explain with examples...

largeDF <- data.frame(colA = c('a', 'b', 'b', 'a', 'a', 'b'),
                      colB = c('x', 'y', 'y', 'x', 'y', 'y'),
                      colC = 1:6)

sampleDF <- data.frame(colA = c('a','a','a','a','b','b','b','b','b','b'),
                       colB = c('x','x','y','y','x','y','y','y','y','y'),
                       sample = 1:10)

I then want to add a new column sample to largeDF, which is a random sample of the sample column in sampleDF for the appropriate subset of colA and colB.

For example, for the first row the values are a and x, so the value will be a random sample of 1 or 2, for the next row (b and y) it will be a random sample of 6, 7, 8, 9 or 10.

So we could end up with something like:

  rowA rowB rowC sample
1    a    x    1      2
2    b    y    2      9
3    b    y    3      7
4    a    x    4      2
5    a    y    5      4
6    b    y    6      8

Any help would be appreciated!

Upvotes: 1

Views: 219

Answers (4)

NiCl2
NiCl2

Reputation: 11

I think this is one possible solution for you...

library(dplyr)
largeDF_sample <- sapply(1:nrow(largeDF), function(x) {
    sampleDF_part = filter(sampleDF, colA==largeDF$colA[x] & colB==largeDF$colB[x])
    return(sample(sampleDF_part$sample)[1])
})
largeDF$sample <- largeDF_sample

Upvotes: 0

989
989

Reputation: 12935

You could do something like this:

largeDF$sample <- apply(largeDF,1,function(a) 
                     with(sampleDF, sample(sampleDF[colA==a[1] & colB==a[2],]$sample,1)))

Upvotes: 1

Andrew Gustar
Andrew Gustar

Reputation: 18435

Using dplyr... (This throws a few warnings, but appears to work anyway.)

library(dplyr)

largeDF <- largeDF %>% group_by(colA,colB) %>% 
            mutate(sample=sample(sampleDF$sample[sampleDF$colA==colA & sampleDF$colB==colB],
                   size=n(),replace=TRUE))

largeDF

    colA   colB  colC sample
  <fctr> <fctr> <int>  <int>
1      a      x     1      2
2      b      y     2      6
3      b      y     3      9
4      a      x     4      1
5      a      y     5      4
6      b      y     6      9

Upvotes: 1

Evan Friedland
Evan Friedland

Reputation: 3184

I do not quite understand the question but it seems that you are just adding a new column in the large data frame that is just the sampled "sample" column from a subsample... see if the following code gives you an idea into the functionality you need:

cbind.data.frame(largeDF, sample = sample(sampleDF$sample, nrow(largeDF)))
#  colA colB colC sample
#1    a    x    1      9
#2    b    y    2     10
#3    b    y    3      1
#4    a    x    4      3
#5    a    y    5      6
#6    b    y    6      7

Upvotes: 0

Related Questions