Reputation: 1250
I created a data frame with random values
n <- 50
df <- data.frame(id = seq (1:n),
age = sample(c(20:90), n, rep = TRUE),
sex = sample(c("m", "f"), n, rep = TRUE, prob = c(0.55, 0.45))
)
and would like to introduce a few NA
values to simulate real world data. I am trying to use apply
but cannot get there. The line
apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]})
will retrieve random values alright, but
apply(subset(df,select=-id), 2, function(x) {x[sample(c(1:n),floor(n/10))]<-NA})
will not set them to NA
. Have tried with
and within
, too.
Brute force works:
for (i in (1:floor(n/10))) {
df[sample(c(1:n), 1), sample(c(2:ncol(df)), 1)] <- NA
}
But I'd prefer to use the apply
family.
Upvotes: 11
Views: 10031
Reputation: 18595
Using dplyr
1 you could arrive at the desired solution using the following, compact, syntax:
set.seed(123)
library("tidyverse")
n <- 50
df <- data.frame(
id = seq (1:n),
age = sample(c(20:90), n, replace = TRUE),
sex = sample(c("m", "f"), n, replace = TRUE, prob = c(0.55, 0.45))
)
mutate(.data = as_tibble(df),
across(
.cols = all_of(c("age", "sex")),
.fns = ~ ifelse(row_number(.x) %in% sample(1:n(), size = (10 * n(
) / 100)), NA, .x)
))
Approximatly 10% of values is replaced with NA per column. This follows from sample(1:n(), size = (10 * n() / 100))
count(.Last.value, sex)
# A tibble: 3 x 2
# sex n
# <chr> <int>
# 1 f 21
# 2 m 24
# 3 NA 5
# A tibble: 50 x 3
# id age sex
# <int> <int> <chr>
# 1 1 50 m
# 2 2 70 m
1 I'm loading tidyverse
as replace_na
is available via tidyr
.
Upvotes: 3
Reputation: 52209
You can also use prodNA
from the missForest package.
library(missForest)
library(dplyr)
> bind_cols(df[1],missForest::prodNA(df[-1],noNA=0.1))
# A tibble: 50 x 3
id age sex
<int> <int> <fct>
1 1 NA m
2 2 84 NA
3 3 82 f
4 4 42 f
5 5 35 m
6 6 80 m
7 7 90 f
8 8 NA NA
9 9 89 f
10 10 42 m
# … with 40 more rows
Upvotes: 2
Reputation: 53
To introduce certain percentage of NAs in your dataframe you could use this:
while(sum(is.na(df) == TRUE) < (nrow(df) * ncol(df) * percentage/100)){
df[sample(nrow(df),1), sample(ncol(df),1)] <- NA
}
you could also change "(nrow(df) * ncol(df) * percentage/100)" to a fixed number of NAs
Upvotes: 2
Reputation: 890
here is another simple way to go at it
your data frame
df<-mtcars
Number of missing required
nbr_missing<-20
sample row and column indices
y<-data.frame(row=sample(nrow(df),size=nbr_missing,replace = T),
col=sample(ncol(df),size = nbr_missing,replace = T))
remove duplication
y<-y[!duplicated(y),]
use matrix indexing
df[as.matrix(y)]<-NA
Upvotes: 2
Reputation: 13334
Simply pass your dataframe into the following function. The only arguments are the frame you want to add NAs to and the number of features (columns) you want to have with NAs.
add_random_nas_to_frame <- function(frame, num_features) {
col_order <- names(frame)
rand_cols <- sample(ncol(frame), num_features)
left_overs <- which(!names(frame) %in% names(frame[,rand_cols]))
other_frame <- frame[,left_overs]
nas_added <- data.frame(lapply(frame[,rand_cols], function(x) x[sample(c(TRUE, NA), prob = c(sample(100, 1)/100, 0.15), size = length(x), replace = TRUE)]))
final_frame <- cbind(other_frame, nas_added)
final_frame <- final_frame[,col_order]
return(final_frame)
}
For example, using the full dataset from banking dataset from UCI:
https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
bank <- read.table(file='path_to_data', sep =";", stringsAsFactors = F, header = T)
And viewing the original missing data:
We can see there is no missing data in the original frame.
Now applying our function:
bank_nas <- add_random_nas_to_frame(bank, 5)
Upvotes: 1
Reputation: 132874
Apply returns an array, thereby converting all columns to the same type. You could use this instead:
df[,-1] <- do.call(cbind.data.frame,
lapply(df[,-1], function(x) {
x[sample(c(1:n),floor(n/10))]<-NA
x
})
)
Or use a for
loop:
for (i in seq_along(df[,-1])+1) {
is.na(df[sample(seq_len(n), floor(n/10)),i]) <- TRUE
}
Upvotes: 4
Reputation: 226557
I think you need to return the x
value from the function:
apply(subset(df,select=-id), 2, function(x)
{x[sample(c(1:n),floor(n/10))]<-NA; x})
but you also need to assign this back to the relevant subset of the data frame (and subset(...) <- ...
doesn't work)
idCol <- names(df)=="id"
df[,!idCol] <- apply(df[,!idCol], 2, function(x)
{x[sample(1:n,floor(n/10))] <- NA; x})
(if you have only a single non-ID column you'll need df[,!idCol,drop=FALSE]
)
Upvotes: 2
Reputation: 54247
Return x
within your function:
> df <- apply (df, 2, function(x) {x[sample( c(1:n), floor(n/10))] <- NA; x} )
> tail(df)
id age sex
[45,] "45" "41" NA
[46,] "46" NA "f"
[47,] "47" "38" "f"
[48,] "48" "32" "f"
[49,] "49" "53" NA
[50,] "50" "74" "f"
Upvotes: 6