Reputation: 6649
I have a large data frame of 100,000's rows, and I want to add a column where the value is a sample of a subset of another data frame based on common names in the data frames. Might be easier to explain with examples...
largeDF <- data.frame(colA = c('a', 'b', 'b', 'a', 'a', 'b'),
colB = c('x', 'y', 'y', 'x', 'y', 'y'),
colC = 1:6)
sampleDF <- data.frame(colA = c('a','a','a','a','b','b','b','b','b','b'),
colB = c('x','x','y','y','x','y','y','y','y','y'),
sample = 1:10)
I then want to add a new column sample
to largeDF
, which is a random sample of the sample
column in sampleDF
for the appropriate subset of colA
and colB
.
For example, for the first row the values are a
and x
, so the value will be a random sample of 1
or 2
, for the next row (b
and y
) it will be a random sample of 6, 7, 8, 9 or 10
.
So we could end up with something like:
rowA rowB rowC sample
1 a x 1 2
2 b y 2 9
3 b y 3 7
4 a x 4 2
5 a y 5 4
6 b y 6 8
Any help would be appreciated!
Upvotes: 1
Views: 219
Reputation: 11
I think this is one possible solution for you...
library(dplyr)
largeDF_sample <- sapply(1:nrow(largeDF), function(x) {
sampleDF_part = filter(sampleDF, colA==largeDF$colA[x] & colB==largeDF$colB[x])
return(sample(sampleDF_part$sample)[1])
})
largeDF$sample <- largeDF_sample
Upvotes: 0
Reputation: 12935
You could do something like this:
largeDF$sample <- apply(largeDF,1,function(a)
with(sampleDF, sample(sampleDF[colA==a[1] & colB==a[2],]$sample,1)))
Upvotes: 1
Reputation: 18435
Using dplyr
... (This throws a few warnings, but appears to work anyway.)
library(dplyr)
largeDF <- largeDF %>% group_by(colA,colB) %>%
mutate(sample=sample(sampleDF$sample[sampleDF$colA==colA & sampleDF$colB==colB],
size=n(),replace=TRUE))
largeDF
colA colB colC sample
<fctr> <fctr> <int> <int>
1 a x 1 2
2 b y 2 6
3 b y 3 9
4 a x 4 1
5 a y 5 4
6 b y 6 9
Upvotes: 1
Reputation: 3184
I do not quite understand the question but it seems that you are just adding a new column in the large data frame that is just the sampled "sample" column from a subsample... see if the following code gives you an idea into the functionality you need:
cbind.data.frame(largeDF, sample = sample(sampleDF$sample, nrow(largeDF)))
# colA colB colC sample
#1 a x 1 9
#2 b y 2 10
#3 b y 3 1
#4 a x 4 3
#5 a y 5 6
#6 b y 6 7
Upvotes: 0