Reputation: 325
I am trying to generate a fake dataset for testing.
It was easy enough to generate the columns that exist in all combinations:
subject <- 1:5
visit <- c("D0", "D100", "D500")
isotype <- c("IgG", "IgA", "IgM", "IgD)
testdata <- expand.grid(subject, visit, isotype)
names(testdata) <- c("subject", "visit", "isotype")
Now I need to create two more columns; "positivity" with a particular value for each group in "visit", and "response" with an random integer with a range dependent on each group in "visit".
For "positivity", I could do it this way:
testdata[testdata$visit == "D0", c("positivity")] <- NA
testdata[testdata$visit == "D100", c("positivity")] <- 1
testdata[testdata$visit == "D500", c("positivity")] <- 0
and for "response", I could do it this way:
testdata[testdata$visit == "D0", c("response")] <- sample(1:100, 1)
testdata[testdata$visit == "D100", c("response")] <- sample(20000:30000, 1)
testdata[testdata$visit == "D500", c("response")] <- sample(1:100, 1)
but in reality I have many more unique observations in "visit" than this and that would take forever. I was hoping I could use dplyr and group_by to loop through each group and assign "positivity" from a vector since the length of that vector should be equal to the number of groups in "visit" and assign "response" with a vector of ranges for the sample method.
positivityvalues <- c(NA, 1, 0)
responseranges <- c(1:100, 1:500, 1:100)
testdata <- testdata %>%
group_by(visit) %>%
mutate(#i can't figure out what to put here
#positivity[1] = positivityvalues[1] etc...
#response[1] = sample(responseranges[1], 1) etc...
)
to get something like this (for the sake of clarity, only the first two subjects and isotypes are listed)
subject visit isotype positivity response
1 D0 IgG NA 58
1 D100 IgG 1 27093
1 D500 IgG 0 2
1 D0 IgA NA 42
1 D100 IgA 1 28921
1 D500 IgA 0 85
2 D0 IgG NA 86
2 D100 IgG 1 26039
2 D500 IgG 0 54
2 D0 IgA NA 99
2 D100 IgA 1 29021
2 D500 IgA 0 23
Thanks
Edit* finished updates
Edit2* Solution:
ranges <- list(D0=c(1:100), D100=c(25000:32000), D500=c(1:100))
positives <- c(D0=NA, D100=1, D500=0)
testdata$positivity <- positives[testdata$visit]
testdata$responsetemp <- ranges[testdata$visit]
testdata$reponse <- lapply(testdata$responsetemp, function(x) sample(x, 1))
Upvotes: 0
Views: 450
Reputation: 887068
Here is an option using tidyverse
. Create a named vector with the unique values of 'visit' (it is not clear how the values will be changed when there are more unique elements in 'visit'. Use that to match the visit elements and replace that with NA, 0, 1 of the matched vector, then split
the data by 'visit', use map2
to sample
from the range
of corresponding vector
library(tidyverse)
v1 <- setNames(c(NA, 1, 0), as.character(unique(testdata$visit)))
testdata %>%
mutate(positivity = v1[visit]) %>%
split(.$visit) %>%
map2_df(., list(1:100, 20000:30000, 1:100), ~
.x %>%
mutate(response = sample(.y, n())))
Upvotes: 1
Reputation: 18425
You can do this with a named vector...
testdata <- expand.grid(subject=subject, visit=visit, isotype=isotype)
#this way to get column names
positivityvalues <- c(D0=NA, D100=1, D500=0) #add names
testdata$positivity <- positivityvalues[testdata$visit] #adds value by name
You could do something similar with the parameters for the sample
function in the response
column.
Upvotes: 1