Phil
Phil

Reputation: 8107

R - Randomly selecting variables and manipulating them on a row-wise basis

I'm trying to go through each row of my data frame, randomly select half of the variables, and set the variable for that particular row to NA.

For example, with the mydf dataset below, I'd like for my first row to randomly select 3 variables (say QB, QE, QF) and set their scores to NA, then again for the 2nd row (say QA, QD, QE) and so forth:

library(tibble)
mydf <- tibble(QA = rnorm(100),
QB = rnorm(100), 
QC = rnorm(100), 
QD = rnorm(100), 
QE = rnorm(100), 
QF = rnorm(100))

My attempt, but it doesn't appear to do anything:

vars <- names(mydf)
for (i in nrow(mydf)){
  miss_vars <- sample(vars, 3)
  for (j in miss_vars) {
     mydf[i,j] <- NA
#mydf[i,][[j]] <- NA #Also tried this.
   }
}

Upvotes: 1

Views: 534

Answers (2)

989
989

Reputation: 12937

Try this vectorized:

m <- as.matrix(mydf)
n <- 3 # number of randoms to be selected
inds <- cbind(rep(1:nrow(mydf), each=n), c(replicate(nrow(mydf), sample(ncol(mydf), n))))
m[inds] <- NA
res <- as.data.frame(m)

Here is how:

  1. First take the matrix of data frame to benefit from the needed vectorization
  2. Define the number of columns to be selected randomly per row
  3. Generate the the matrix inds in which each row and corresponding random column for data frame is placed
  4. Set those rows and cols to NA
  5. Get back the data frame

In res, you will have a data frame in which 3 columns randomly are set to NA per row. The output for the provided data frame is:

           # QA          QB          QC        QD         QE         QF
# 1  -0.6264538          NA          NA  1.358680 -0.1645236         NA
# 2   0.1836433          NA  0.78213630        NA -0.2533617         NA
# 3          NA          NA  0.07456498        NA  0.6969634  0.3411197
# 4          NA -2.21469989          NA        NA  0.5566632 -1.1293631
# 5          NA  1.12493092  0.61982575        NA         NA  1.4330237
# 6  -0.8204684 -0.04493361          NA        NA         NA  1.9803999
# 7   0.4874291 -0.01619026          NA -0.394290         NA         NA
# 8   0.7383247          NA -1.47075238        NA         NA -1.0441346
# 9          NA  0.82122120          NA  1.100025         NA  0.5697196
# 10         NA  0.59390132  0.41794156        NA         NA -0.1350546

data

set.seed(1)
mydf <- data.frame(QA = rnorm(10),
QB = rnorm(10), 
QC = rnorm(10), 
QD = rnorm(10), 
QE = rnorm(10), 
QF = rnorm(10))

Upvotes: 1

Phil
Phil

Reputation: 8107

Should have been:

for (i in seq_len(nrow(mydf))){
  miss_vars <- sample(vars, 3)
  for (j in miss_vars) {
    mydf[i,][[j]] <- NA
  }
}

Upvotes: 1

Related Questions