Reputation: 1250
Within the simulated data set
n = 50
set.seed(378)
df <- data.frame(
age = sample(c(20:90), n, rep = T),
sex = sample(c("m", "f"), n, rep = T, prob = c(0.55, 0.45)),
smoker = sample(c("never", "former", "active"), n, rep = T, prob = c(0.4, 0.45, 0.15)),
py = abs(rnorm(n, 25, 10)),
yrsquit = abs (rnorm (n, 10,2)),
outcome = as.factor(sample(c(0, 1), n, rep = T, prob = c(0.8, 0.2)))
)
I need to introduce some imbalance between the outcome groups (1=disease, 0=no disease). For example, subjects with the disease are older and more likely to be male. I tried
df1 <- within(df, sapply(length(outcome), function(x) {
if (outcome[x] == 1) {
age[x] <- age[x] + 15
sex[x] <- sample(c("m","f"), prob=c(0.8,0.2))
}
}))
but there is no difference as shown by
tapply(df$sex, df$outcome, length)
tapply(df1$sex, df$outcome, length)
tapply(df$age, df$outcome, mean)
tapply(df1$age, df$outcome, mean)
Upvotes: 0
Views: 762
Reputation: 81693
The use of sapply
inside within
does not work as you expect. The function within
does only use the returned value of sapply
. But in your code, sapply
returns NULL
. Hence, within
does not modify the data frame.
Here is an easier way to modify the data frame without a loop or sapply
:
idx <- df$outcome == "1"
df1 <- within(df, {age[idx] <- age[idx] + 15;
sex[idx] <- sample(c("m", "f"), sum(idx),
replace = TRUE, prob = c(0.8, 0.2))})
Now, the data frames are different:
> tapply(df$age, df$outcome, mean)
0 1
60.46341 57.55556
> tapply(df1$age, df$outcome, mean)
0 1
60.46341 72.55556
> tapply(df$sex, df$outcome, summary)
$`0`
f m
24 17
$`1`
f m
2 7
> tapply(df1$sex, df$outcome, summary)
$`0`
f m
24 17
$`1`
f m
1 8
Upvotes: 2