Replace NA values in a column in dataframe based on probability of occurrence with non-NA values

Question

I have to populate a set of 'Failure' values within a 'Bucket' randomly.

For instance,

| Bucket | Failure | Id |
|--------|---------|----|
| B1     | F1      | 1  |
| B1     | F2      | 2  |
| B1     | F1      | 3  |
| B1     | null    | 4  |
| B1     | null    | 5  |
| B2     | F3      | 6  |
| B2     | F4      | 7  |
| B2     | null    | 8  |

In table above, each Bucket can contain many records. Some of those records will contain Failure populated, but most will not. My goal is to randomly assign the Failure based on the proportion of Failures within a bucket. For instance, for combination - {B1, F1} as compared to the proportion of B1 records(with Failure populated) is 2/3 and for {B1, F2} the proportion of B1 records(with failure populated) is 1/3.

Therefore the records of B1 with null Failure column (Id=4,5) should get randomly either failure F1 or F2 but with the same proportion of F1 as 2/3 and F2 as 1/3. This logic needs to be applied for all buckets within the table.

I see that this is a complicated thing. I'm relatively a R noob, therefore, any code examples would be much appreciated.

In between, I see this question. But the solution doesn't run: Fill missing value based on probability of occurrence

See sample code:

test <- data.frame(
bucket = c(rep('B1', 5), rep('B2',3))
    , failure = c('F1', 'F2', 'F1', NA, NA, 'F3', 'F4', NA)
    , Id = seq(1:8)
)

test

sample_fill_na = function(x) {
    x_na = is.na(x)
    x[x_na] = sample(x[!x_na], size = sum(x_na), replace = TRUE)
    return(x)
}

test[, failure := sample_fill_na(failure), by = bucket]

Florian · Accepted Answer

Here is a possible solution, we can use the probs argument in the sample function, and create weights (sample converts it to probabilities for us) using the table function.

Hope this helps!

library(data.table)
test <- data.frame(bucket = c(rep('B1', 5), rep('B2',3)), failure = c('F1', 'F2', 'F1', NA, NA, 'F3', 'F4', NA), Id = seq(1:8))

fillF <- function(x){
  y <- table(x)
  x[is.na(x)] <- sample(names(y),sum(is.na(x)),prob =y,replace=T)
  return(x)
}

setDT(test)[, failure := fillF(failure), by = bucket]

output:

   bucket failure Id
1:     B1      F1  1
2:     B1      F2  2
3:     B1      F1  3
4:     B1      F1  4
5:     B1      F1  5
6:     B2      F3  6
7:     B2      F4  7
8:     B2      F3  8

We can easily check the proportions with the following code:

set.seed(1)
for(i in 1:9){test=rbind(test,test)}
setDT(test)[, failure := fillF(failure), by = bucket]
table(test$failure)

And indeed, the proportions look alright:

  F1   F2   F3   F4 
1705  855  749  787

EDIT: If there are empty groups in your data, we ave to decide on probbilities for ourselves. There are two logical options, fill always with a default (e.g. F1), or randomly sample from all options. So:

library(data.table)
test <- data.frame(bucket = c(rep('B1', 5), rep('B2',3),'B3'), failure = c('F1', 'F2', 'F1', NA, NA, 'F3', 'F4', NA,NA), Id = seq(1:9))

fillF <- function(x){
  y <- table(x)
  if(sum(y)<1){y=c('F1'=1)} # always F1
  # if(sum(y)<1){y=y+1} # randomly sample from all possibilities
  x[is.na(x)] <- sample(names(y),sum(is.na(x)),prob =y,replace=T)
  return(x)
}

setDT(test)[, failure := fillF(failure), by = bucket]

Replace NA values in a column in dataframe based on probability of occurrence with non-NA values

Answers (2)

Related Questions