K Owen
K Owen

Reputation: 1250

How do I add random `NA`s into a data frame

I created a data frame with random values

n <- 50
df <- data.frame(id = seq (1:n),
age = sample(c(20:90), n, rep = TRUE), 
sex = sample(c("m", "f"), n, rep = TRUE, prob = c(0.55, 0.45))
)

and would like to introduce a few NA values to simulate real world data. I am trying to use apply but cannot get there. The line

apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]})

will retrieve random values alright, but

apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]<-NA}) 

will not set them to NA. Have tried with and within, too.

Brute force works:

for (i in (1:floor(n/10))) {
  df[sample(c(1:n), 1), sample(c(2:ncol(df)), 1)] <- NA
  }

But I'd prefer to use the apply family.

Upvotes: 11

Views: 10031

Answers (8)

Konrad
Konrad

Reputation: 18595

Using dplyr1 you could arrive at the desired solution using the following, compact, syntax:

set.seed(123)
library("tidyverse")
n <- 50
df <- data.frame(
  id = seq (1:n),
  age = sample(c(20:90), n, replace  = TRUE),
  sex = sample(c("m", "f"), n, replace = TRUE, prob = c(0.55, 0.45))
)
mutate(.data = as_tibble(df),
       across(
         .cols = all_of(c("age", "sex")),
         .fns = ~ ifelse(row_number(.x) %in% sample(1:n(), size = (10 * n(
         ) / 100)), NA, .x)
       ))

Results

Approximatly 10% of values is replaced with NA per column. This follows from sample(1:n(), size = (10 * n() / 100))

count(.Last.value, sex)
#   A tibble: 3 x 2
#   sex       n
#   <chr> <int>
# 1 f        21
# 2 m        24
# 3 NA        5

#  A tibble: 50 x 3
#      id   age sex  
#   <int> <int> <chr>
# 1     1    50 m    
# 2     2    70 m  

1 I'm loading tidyverse as replace_na is available via tidyr.

Upvotes: 3

Ma&#235;l
Ma&#235;l

Reputation: 52209

You can also use prodNA from the missForest package.

library(missForest)
library(dplyr)

> bind_cols(df[1],missForest::prodNA(df[-1],noNA=0.1))

# A tibble: 50 x 3
      id   age sex  
   <int> <int> <fct>
 1     1    NA m    
 2     2    84 NA   
 3     3    82 f    
 4     4    42 f    
 5     5    35 m    
 6     6    80 m    
 7     7    90 f    
 8     8    NA NA   
 9     9    89 f    
10    10    42 m    
# … with 40 more rows

Upvotes: 2

Derek van Tilborg
Derek van Tilborg

Reputation: 53

To introduce certain percentage of NAs in your dataframe you could use this:

while(sum(is.na(df) == TRUE) < (nrow(df) * ncol(df) * percentage/100)){
  df[sample(nrow(df),1), sample(ncol(df),1)] <- NA
}

you could also change "(nrow(df) * ncol(df) * percentage/100)" to a fixed number of NAs

Upvotes: 2

Adam Lee Perelman
Adam Lee Perelman

Reputation: 890

here is another simple way to go at it

your data frame

df<-mtcars

Number of missing required

nbr_missing<-20

sample row and column indices

y<-data.frame(row=sample(nrow(df),size=nbr_missing,replace = T),
          col=sample(ncol(df),size = nbr_missing,replace = T))

remove duplication

y<-y[!duplicated(y),]

use matrix indexing

df[as.matrix(y)]<-NA

Upvotes: 2

Cybernetic
Cybernetic

Reputation: 13334

Simply pass your dataframe into the following function. The only arguments are the frame you want to add NAs to and the number of features (columns) you want to have with NAs.

add_random_nas_to_frame <- function(frame, num_features) {
   col_order <- names(frame) 
   rand_cols <- sample(ncol(frame), num_features)
   left_overs <- which(!names(frame) %in% names(frame[,rand_cols]))
   other_frame <- frame[,left_overs]
   nas_added <- data.frame(lapply(frame[,rand_cols], function(x) x[sample(c(TRUE, NA), prob = c(sample(100, 1)/100, 0.15), size = length(x), replace = TRUE)]))
   final_frame <- cbind(other_frame, nas_added)
   final_frame <- final_frame[,col_order]
   return(final_frame)
}

For example, using the full dataset from banking dataset from UCI:

https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

bank <- read.table(file='path_to_data', sep =";", stringsAsFactors = F, header = T)

And viewing the original missing data:

enter image description here

We can see there is no missing data in the original frame.

Now applying our function:

bank_nas <- add_random_nas_to_frame(bank, 5)

enter image description here

Upvotes: 1

Roland
Roland

Reputation: 132874

Apply returns an array, thereby converting all columns to the same type. You could use this instead:

df[,-1] <- do.call(cbind.data.frame, 
                   lapply(df[,-1], function(x) {
                     x[sample(c(1:n),floor(n/10))]<-NA
                     x
                   })
                   )

Or use a for loop:

for (i in seq_along(df[,-1])+1) {
  is.na(df[sample(seq_len(n), floor(n/10)),i]) <- TRUE
}

Upvotes: 4

Ben Bolker
Ben Bolker

Reputation: 226557

I think you need to return the x value from the function:

apply(subset(df,select=-id), 2, function(x) 
     {x[sample(c(1:n),floor(n/10))]<-NA; x}) 

but you also need to assign this back to the relevant subset of the data frame (and subset(...) <- ... doesn't work)

idCol <- names(df)=="id"
df[,!idCol] <- apply(df[,!idCol], 2, function(x) 
     {x[sample(1:n,floor(n/10))] <- NA; x})

(if you have only a single non-ID column you'll need df[,!idCol,drop=FALSE])

Upvotes: 2

lukeA
lukeA

Reputation: 54247

Return x within your function:

> df <- apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} )
> tail(df)
      id   age  sex
[45,] "45" "41" NA 
[46,] "46" NA   "f"
[47,] "47" "38" "f"
[48,] "48" "32" "f"
[49,] "49" "53" NA 
[50,] "50" "74" "f"

Upvotes: 6

Related Questions