Reputation: 409
I have to populate a set of 'Failure' values within a 'Bucket' randomly.
For instance,
| Bucket | Failure | Id |
|--------|---------|----|
| B1 | F1 | 1 |
| B1 | F2 | 2 |
| B1 | F1 | 3 |
| B1 | null | 4 |
| B1 | null | 5 |
| B2 | F3 | 6 |
| B2 | F4 | 7 |
| B2 | null | 8 |
In table above, each Bucket can contain many records. Some of those records will contain Failure populated, but most will not. My goal is to randomly assign the Failure based on the proportion of Failures within a bucket. For instance, for combination - {B1, F1} as compared to the proportion of B1 records(with Failure populated) is 2/3 and for {B1, F2} the proportion of B1 records(with failure populated) is 1/3.
Therefore the records of B1 with null Failure column (Id=4,5) should get randomly either failure F1 or F2 but with the same proportion of F1 as 2/3 and F2 as 1/3. This logic needs to be applied for all buckets within the table.
I see that this is a complicated thing. I'm relatively a R noob, therefore, any code examples would be much appreciated.
In between, I see this question. But the solution doesn't run: Fill missing value based on probability of occurrence
See sample code:
test <- data.frame(
bucket = c(rep('B1', 5), rep('B2',3))
, failure = c('F1', 'F2', 'F1', NA, NA, 'F3', 'F4', NA)
, Id = seq(1:8)
)
test
sample_fill_na = function(x) {
x_na = is.na(x)
x[x_na] = sample(x[!x_na], size = sum(x_na), replace = TRUE)
return(x)
}
test[, failure := sample_fill_na(failure), by = bucket]
Upvotes: 2
Views: 1135
Reputation: 25395
Here is a possible solution, we can use the probs
argument in the sample
function, and create weights (sample
converts it to probabilities for us) using the table
function.
Hope this helps!
library(data.table)
test <- data.frame(bucket = c(rep('B1', 5), rep('B2',3)), failure = c('F1', 'F2', 'F1', NA, NA, 'F3', 'F4', NA), Id = seq(1:8))
fillF <- function(x){
y <- table(x)
x[is.na(x)] <- sample(names(y),sum(is.na(x)),prob =y,replace=T)
return(x)
}
setDT(test)[, failure := fillF(failure), by = bucket]
output:
bucket failure Id
1: B1 F1 1
2: B1 F2 2
3: B1 F1 3
4: B1 F1 4
5: B1 F1 5
6: B2 F3 6
7: B2 F4 7
8: B2 F3 8
We can easily check the proportions with the following code:
set.seed(1)
for(i in 1:9){test=rbind(test,test)}
setDT(test)[, failure := fillF(failure), by = bucket]
table(test$failure)
And indeed, the proportions look alright:
F1 F2 F3 F4
1705 855 749 787
EDIT: If there are empty groups in your data, we ave to decide on probbilities for ourselves. There are two logical options, fill always with a default (e.g. F1
), or randomly sample from all options. So:
library(data.table)
test <- data.frame(bucket = c(rep('B1', 5), rep('B2',3),'B3'), failure = c('F1', 'F2', 'F1', NA, NA, 'F3', 'F4', NA,NA), Id = seq(1:9))
fillF <- function(x){
y <- table(x)
if(sum(y)<1){y=c('F1'=1)} # always F1
# if(sum(y)<1){y=y+1} # randomly sample from all possibilities
x[is.na(x)] <- sample(names(y),sum(is.na(x)),prob =y,replace=T)
return(x)
}
setDT(test)[, failure := fillF(failure), by = bucket]
Upvotes: 1
Reputation: 103
Here is a potential solution, it basically sample the failures with the probability given by their proportion in each bucket.
test <- data.frame(
bucket = c(rep('B1', 5), rep('B2',3))
, failure = c('F1', 'F2', 'F1', NA, NA, 'F3', 'F4', NA)
, Id = seq(1:8)
)
fillNA <- function(test) {
uni <- unique(test$bucket)
for (i in 1:length(uni)) {
pos <- test$bucket == uni[i]
tbl <- table(test[pos, 2])
proportions <- tbl / sum(tbl)
posNA <- pos & is.na(test[, 2])
test[posNA, 2] <- sample(levels(test[, 2]),
sum(posNA),
replace = TRUE,
prob = proportions)
}
return(test)
}
fillNA(test)
Result:
bucket failure Id
1 B1 F1 1
2 B1 F2 2
3 B1 F1 3
4 B1 F1 4
5 B1 F1 5
6 B2 F3 6
7 B2 F4 7
8 B2 F4 8
Upvotes: 1