Conditional change to data frame column(s) based on values in other columns

Question

Within the simulated data set

n =  50
set.seed(378)
df <- data.frame(
  age = sample(c(20:90), n, rep = T), 
  sex = sample(c("m", "f"), n, rep = T, prob = c(0.55, 0.45)),
  smoker = sample(c("never", "former", "active"), n, rep = T, prob = c(0.4, 0.45, 0.15)), 
  py = abs(rnorm(n, 25, 10)),
  yrsquit = abs (rnorm (n, 10,2)),
  outcome = as.factor(sample(c(0, 1), n, rep = T, prob = c(0.8, 0.2)))
  )

I need to introduce some imbalance between the outcome groups (1=disease, 0=no disease). For example, subjects with the disease are older and more likely to be male. I tried

df1 <- within(df, sapply(length(outcome), function(x) {
if (outcome[x] == 1)  {
  age[x] <- age[x] + 15
  sex[x] <- sample(c("m","f"), prob=c(0.8,0.2))
}
}))

but there is no difference as shown by

tapply(df$sex, df$outcome, length)
tapply(df1$sex, df$outcome, length)
tapply(df$age, df$outcome, mean)
tapply(df1$age, df$outcome, mean)

Sven Hohenstein · Accepted Answer

The use of sapply inside within does not work as you expect. The function within does only use the returned value of sapply. But in your code, sapply returns NULL. Hence, within does not modify the data frame.

Here is an easier way to modify the data frame without a loop or sapply:

idx <- df$outcome == "1"
df1 <- within(df, {age[idx] <- age[idx] + 15; 
                   sex[idx] <- sample(c("m", "f"), sum(idx), 
                                      replace = TRUE, prob = c(0.8, 0.2))})

Now, the data frames are different:

> tapply(df$age, df$outcome, mean)
       0        1 
60.46341 57.55556 
> tapply(df1$age, df$outcome, mean)
       0        1 
60.46341 72.55556 

> tapply(df$sex, df$outcome, summary)
$`0`
 f  m 
24 17 

$`1`
f m 
2 7 

> tapply(df1$sex, df$outcome, summary)
$`0`
 f  m 
24 17 

$`1`
f m 
1 8

Conditional change to data frame column(s) based on values in other columns

Answers (1)

Related Questions