K Owen
K Owen

Reputation: 1250

Conditional change to data frame column(s) based on values in other columns

Within the simulated data set

n =  50
set.seed(378)
df <- data.frame(
  age = sample(c(20:90), n, rep = T), 
  sex = sample(c("m", "f"), n, rep = T, prob = c(0.55, 0.45)),
  smoker = sample(c("never", "former", "active"), n, rep = T, prob = c(0.4, 0.45, 0.15)), 
  py = abs(rnorm(n, 25, 10)),
  yrsquit = abs (rnorm (n, 10,2)),
  outcome = as.factor(sample(c(0, 1), n, rep = T, prob = c(0.8, 0.2)))
  )

I need to introduce some imbalance between the outcome groups (1=disease, 0=no disease). For example, subjects with the disease are older and more likely to be male. I tried

df1 <- within(df, sapply(length(outcome), function(x) {
if (outcome[x] == 1)  {
  age[x] <- age[x] + 15
  sex[x] <- sample(c("m","f"), prob=c(0.8,0.2))
}
}))

but there is no difference as shown by

tapply(df$sex, df$outcome, length)
tapply(df1$sex, df$outcome, length)
tapply(df$age, df$outcome, mean)
tapply(df1$age, df$outcome, mean)

Upvotes: 0

Views: 762

Answers (1)

Sven Hohenstein
Sven Hohenstein

Reputation: 81693

The use of sapply inside within does not work as you expect. The function within does only use the returned value of sapply. But in your code, sapply returns NULL. Hence, within does not modify the data frame.

Here is an easier way to modify the data frame without a loop or sapply:

idx <- df$outcome == "1"
df1 <- within(df, {age[idx] <- age[idx] + 15; 
                   sex[idx] <- sample(c("m", "f"), sum(idx), 
                                      replace = TRUE, prob = c(0.8, 0.2))})

Now, the data frames are different:

> tapply(df$age, df$outcome, mean)
       0        1 
60.46341 57.55556 
> tapply(df1$age, df$outcome, mean)
       0        1 
60.46341 72.55556 

> tapply(df$sex, df$outcome, summary)
$`0`
 f  m 
24 17 

$`1`
f m 
2 7 

> tapply(df1$sex, df$outcome, summary)
$`0`
 f  m 
24 17 

$`1`
f m 
1 8 

Upvotes: 2

Related Questions