Reputation: 4997
I have a large dataset, x
, that contains replicated values,
some of which are duplicated across its variables:
set.seed(40)
x <- data.frame(matrix(round(runif(1000)), ncol = 10))
x_unique <- x[!duplicated(x),]
I need to sample all instances of each unique row in x a given number of times so I create a new variable that is simply a concatenation of the variables for each row:
# Another way of seeing x is as a single value - will be useful later
x_code <- do.call(paste0, x)
u_code <- x_code[!duplicated(x)]
We need a repeated sample sample from x, replicating each unique row s times. This information is provided in the vector s:
s <- rpois(n = nrow(x_unique), lambda = 0.9)
The question is, how to sample individuals from x to reach the quota set by s, for each unique row? Here's a long and unbeautiful way, that gets the right result:
for(i in 1:length(s)){
xs <- which(x_code %in% u_code[i])
sel <- c(sel, xs[sample(length(xs), size = s[i], replace = T)])
}
x_sampled <- x[sel, ]
This is slow to run and cumbersome to write.
Is there a way to generate the same result (x_sampled
in the above) faster and more concisely? Surely there must be a way!
Upvotes: 2
Views: 190
Reputation: 103948
The key to doing this efficiently is to figure out how to work with the indices, and how to vectorise as much as possible. For your problem, things get much easier if you find the indices for each repeated row:
set.seed(40)
x <- data.frame(matrix(round(runif(1000)), ncol = 10))
index <- 1:nrow(x)
grouped_index <- split(index, x, drop = TRUE)
names(grouped_index) <- NULL
Then you can use Map()
to combine the indices to sample from and the
number of samples to take for each group. I write a wrapper around
sample()
to protect against the annoying behaviour when x
is of
length 1.
sample2 <- function(x, n, ...) {
if (length(x) == 1) return(rep(x, n))
sample(x, n, ...)
}
samples <- rpois(n = length(grouped_index), lambda = 0.9)
sel <- unlist(Map(sample2, grouped_index, samples, replace = TRUE))
sel
#> [1] 66 99 99 2 6 31 90 25 42 57 14 14 8 8 12 77 60
#> [18] 17 17 92 76 76 76 70 95 36 36 36 100 91 41 41 28 69
#> [35] 69 54 54 54 54 81 64 96 35 39 29 11 74 93 82 82 24
#> [52] 46 48 48 48 51 51 73 20 37 71 71 58 16 68 94 94 94
#> [69] 80 80 80 13 13 87 87 67 67 86 49 49 88 88 52 75 47
#> [86] 89 7 79 63 78 72 72 19
If you want to keep in the original order, use sort()
:
sort(sel)
#> [1] 2 6 7 8 8 11 12 13 13 14 14 16 17 17 19 20 24
#> [18] 25 28 29 31 35 36 36 36 37 39 41 41 42 46 47 48 48
#> [35] 48 49 49 51 51 52 54 54 54 54 57 58 60 63 64 66 67
#> [52] 67 68 69 69 70 71 71 72 72 73 74 75 76 76 76 77 78
#> [69] 79 80 80 80 81 82 82 86 87 87 88 88 89 90 91 92 93
#> [86] 94 94 94 95 96 99 99 100
I think the bottleneck in this code will be split()
: base R doesn't
have an efficient way of hashing data frames, so relies on pasting the
columns together.
Upvotes: 2
Reputation: 179558
You can use rep()
to create an index vector, followed by subsetting your data using this index vector.
Try this:
idx <- rep(1:length(s), times=s)
The first few values of idx. Note how the second row gets repeated twice, while row 4 is absent:
idx
[1] 1 2 2 3 6 7 8 10 11 13 14 14 ......
Then do the subsetting. Notice how the new duplicates have row names that indicate the replication.
x_unique[idx, ]
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 1 0 0 0 1 0 0 1 0
2 1 0 1 0 0 1 0 0 0 0
2.1 1 0 1 0 0 1 0 0 0 0
3 1 1 0 0 1 0 0 0 1 0
6 0 0 0 0 1 1 0 0 0 0
7 0 1 1 0 1 1 0 1 1 1
8 1 1 0 1 0 0 1 1 0 0
10 0 0 1 0 1 1 1 1 0 0
....
Upvotes: 1