Reputation: 415
I have the following data frame, with data from 1000 people on sex, three repeated height measures and the age at each measure.
data <- data.frame(
child_id = 1:1000,
sex = rbinom(n = 1000, size = 1, prob = 0.5),
height_5 = rnorm(1000, mean = 80, sd = 5),
height_6 = rnorm(1000, mean = 90, sd = 5),
height_7 = rnorm(1000, mean = 100, sd = 5),
age_5 = rnorm(1000, mean = 5.2, sd = 1.5),
age_6 = rnorm(1000, mean = 6.1, sd = 1.5),
age_7 = rnorm(1000, mean = 7.3, sd = 1.5)
)
data$sex <- factor(data$sex,
levels = c(0,1),
labels = c("Male", "Female"))
### Generate SOME MISSING VALUES -----
data$height_5[which(data$height_5 %in% sample(data$height_5, 25))] <- NA
data$height_6[which(data$height_6 %in% sample(data$height_6, 25))] <- NA
data$height_7[which(data$height_7 %in% sample(data$height_7, 25))] <- NA
I can generate zscores at each measurement as follows
data$ht5z <- scale(data$height_5, center = TRUE, scale = TRUE)
data$ht6z <- scale(data$height_6, center = TRUE, scale = TRUE)
data$ht7z <- scale(data$height_7, center = TRUE, scale = TRUE)
How can i generate these for each sex and year e.g. htzm3 if sex = male and age >=3 and <4, htzm4 if sex = male and age >=4 and <5 etc.
Upvotes: 0
Views: 476
Reputation: 8886
How about this:
library(dplyr)
library(stringr)
library(tidyr)
data %>%
gather(key, value, age_5, age_6, age_7, height_5, height_6, height_7) %>%
separate(key, c("key", "obs_time"), "_") %>%
spread(key, value) %>%
mutate(whole_age = floor(age)) %>%
group_by(sex, whole_age) %>%
mutate(htz = scale(height),
sex_init = str_to_lower(str_extract(sex, "^.")),
sa = paste0("htz", sex_init, whole_age)) %>%
ungroup() %>%
spread(sa, htz)
First we would like to put the data in a tidy format.
To do so we first gather up all of your age and height columns into just two columns: key
and value
. key
then takes on the name of the original variable as values, value
takes on the value under the corresponding variable, and other variables are copied down as is. The data now look like this:
# A tibble: 6,000 x 4
child_id sex key value
<int> <fct> <chr> <dbl>
1 1 Male age_5 5.67
2 1 Male age_6 7.02
3 1 Male age_7 8.86
4 1 Male height_5 79.2
5 1 Male height_6 95.8
6 1 Male height_7 85.0
7 2 Male age_5 3.38
8 2 Male age_6 5.06
9 2 Male age_7 5.47
10 2 Male height_5 79.2
# ... with 5,990 more rows
Second, we separate the key
column into two columns: key
and obs_time
using the "_" as the delimiter. The data now look like:
# A tibble: 6,000 x 5
child_id sex key obs_time value
<int> <fct> <chr> <chr> <dbl>
1 1 Male age 5 5.67
2 1 Male age 6 7.02
3 1 Male age 7 8.86
4 1 Male height 5 79.2
5 1 Male height 6 95.8
6 1 Male height 7 85.0
7 2 Male age 5 3.38
8 2 Male age 6 5.06
9 2 Male age 7 5.47
10 2 Male height 5 79.2
# ... with 5,990 more rows
Third, we spread the values up into two variables: age
and height
. The data now look like:
# A tibble: 3,000 x 5
child_id sex obs_time age height
<int> <fct> <chr> <dbl> <dbl>
1 1 Male 5 5.67 79.2
2 1 Male 6 7.02 95.8
3 1 Male 7 8.86 85.0
4 2 Male 5 3.38 79.2
5 2 Male 6 5.06 81.8
6 2 Male 7 5.47 102.
7 3 Male 5 5.04 80.4
8 3 Male 6 6.37 95.3
9 3 Male 7 7.01 97.4
10 4 Male 5 6.25 90.8
# ... with 2,990 more rows
Fourth, through seventh, we mutate the age category whole_age
and then group by sex
and whole_age
so that when we scale it will be applied separately for each of those groups. Then we do the scaling in each group, extract the first initial of sex
and construct variable names corresponding to the freshly scaled values all in one column called sa
. We can then remove the grouping. The data now look like:
# A tibble: 3,000 x 9
child_id sex obs_time age height whole_age htz sex_init sa
<int> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 1 Male 5 5.67 79.2 5 -0.967 m htzm5
2 1 Male 6 7.02 95.8 7 0.345 m htzm7
3 1 Male 7 8.86 85.0 8 -1.20 m htzm8
4 2 Male 5 3.38 79.2 3 -0.580 m htzm3
5 2 Male 6 5.06 81.8 5 -0.681 m htzm5
6 2 Male 7 5.47 102. 5 1.55 m htzm5
7 3 Male 5 5.04 80.4 5 -0.829 m htzm5
8 3 Male 6 6.37 95.3 6 0.455 m htzm6
9 3 Male 7 7.01 97.4 7 0.529 m htzm7
10 4 Male 5 6.25 90.8 6 -0.0366 m htzm6
# ... with 2,990 more rows
Finally, we can spread the data into the variables you requested. And now we have:
# A tibble: 3,000 x 32
child_id sex obs_time age height whole_age sex_init htzf0 htzf1 htzf10 htzf11 htzf2 htzf3
<int> <fct> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 Male 5 5.67 79.2 5 m NA NA NA NA NA NA
2 1 Male 6 7.02 95.8 7 m NA NA NA NA NA NA
3 1 Male 7 8.86 85.0 8 m NA NA NA NA NA NA
4 2 Male 5 3.38 79.2 3 m NA NA NA NA NA NA
5 2 Male 6 5.06 81.8 5 m NA NA NA NA NA NA
6 2 Male 7 5.47 102. 5 m NA NA NA NA NA NA
7 3 Male 5 5.04 80.4 5 m NA NA NA NA NA NA
8 3 Male 6 6.37 95.3 6 m NA NA NA NA NA NA
9 3 Male 7 7.01 97.4 7 m NA NA NA NA NA NA
10 4 Male 5 6.25 90.8 6 m NA NA NA NA NA NA
# ... with 2,990 more rows, and 19 more variables: htzf4 <dbl>, htzf5 <dbl>, htzf6 <dbl>,
# htzf7 <dbl>, htzf8 <dbl>, htzf9 <dbl>, htzm0 <dbl>, htzm1 <dbl>, htzm10 <dbl>, htzm11 <dbl>,
# htzm12 <dbl>, htzm2 <dbl>, htzm3 <dbl>, htzm4 <dbl>, htzm5 <dbl>, htzm6 <dbl>, htzm7 <dbl>,
# htzm8 <dbl>, htzm9 <dbl>
Upvotes: 1