Reputation: 57
I have a large dataframe with 400 columns of baseline and follow-up scores (and 10,000 subjects). Each alphabet represents a score and I would like to calculate the difference between the follow-up and baseline for each score in a new column:
subid | a_score.baseline | a_score.followup | b_score.baseline | b_score.followup | c_score.baseline | c_score.followup | |
---|---|---|---|---|---|---|---|
1 | 100 | 150 | 5 | 2 | 80 | 70 | |
2 | 120 | 142 | 10 | 9 | 79 | 42 | |
3 | 111 | 146 | 60 | 49 | 89 | 46 | |
4 | 152 | 148 | 4 | 4 | 69 | 48 | |
5 | 110 | 123 | 20 | 18 | 60 | 23 | |
6 | 112 | 120 | 5 | 3 | 12 | 20 | |
7 | 111 | 145 | 6 | 4 | 11 | 45 |
I'd like to calculate the difference between followup and baseline for each score in a new column like this:
df$a_score_difference = df$a_score.followup - df$a_score.baseleine
Any ideas on how to do this efficiently? I really appreciate your help.
code to generate sample data:
subid <- c(1:7)
a_score.baseline <- c(100,120,111,152,110,112,111)
a_score.followup <- c(150,142,146,148,123,120,145)
b_score.baseline <- c(5,10,60,4,20,5,6)
b_score.followup <- c(2,9,49,4,18,3,4)
c_score.baseline <- c(80,79,89,69,60,12,11)
c_score.followup <- c(70,42,46,48,23,20,45)
df <- data.frame(subid,a_score.baseline,a_score.followup,b_score.baseline,b_score.followup,c_score.baseline,c_score.followup)
Upvotes: 0
Views: 683
Reputation: 160407
scores <- sort(grep("score\\.(baseline|followup)", names(df), value = TRUE))
scores
# [1] "a_score.baseline" "a_score.followup" "b_score.baseline" "b_score.followup" "c_score.baseline" "c_score.followup"
scores <- split(scores, sub(".*_", "", scores))
scores
# $score.baseline
# [1] "a_score.baseline" "b_score.baseline" "c_score.baseline"
# $score.followup
# [1] "a_score.followup" "b_score.followup" "c_score.followup"
Map(`-`, df[scores[[2]]], df[scores[[1]]])
# $a_score.followup
# [1] 50 22 35 -4 13 8 34
# $b_score.followup
# [1] -3 -1 -11 0 -2 -2 -2
# $c_score.followup
# [1] -10 -37 -43 -21 -37 8 34
out <- Map(`-`, df[scores[[2]]], df[scores[[1]]])
names(out) <- sub("followup", "difference", names(out))
df <- cbind(df, out)
df
# subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference
# 1 1 100 150 5 2 80 70 50
# 2 2 120 142 10 9 79 42 22
# 3 3 111 146 60 49 89 46 35
# 4 4 152 148 4 4 69 48 -4
# 5 5 110 123 20 18 60 23 13
# 6 6 112 120 5 3 12 20 8
# 7 7 111 145 6 4 11 45 34
# b_score.difference c_score.difference
# 1 -3 -10
# 2 -1 -37
# 3 -11 -43
# 4 0 -21
# 5 -2 -37
# 6 -2 8
# 7 -2 34
There exists (in an unsupervised mode) the possibility that not all followup
s will have comparable baseline
s, which could cause a problem. You might include a test to validate the presence and order:
all(sub("baseline", "followup", scores$score.baseline) == scores$score.followup)
# [1] TRUE
You might consider pivoting the data into a more long format. This can be done in base R as well, but looks a lot simpler when done here:
library(dplyr)
# library(tidyr) # pivot_*
df %>%
tidyr::pivot_longer(
-subid,
names_pattern = "(.*)_score.(.*)",
names_to = c("ltr", ".value")) %>%
mutate(difference = followup - baseline)
# # A tibble: 21 x 5
# subid ltr baseline followup difference
# <int> <chr> <dbl> <dbl> <dbl>
# 1 1 a 100 150 50
# 2 1 b 5 2 -3
# 3 1 c 80 70 -10
# 4 2 a 120 142 22
# 5 2 b 10 9 -1
# 6 2 c 79 42 -37
# 7 3 a 111 146 35
# 8 3 b 60 49 -11
# 9 3 c 89 46 -43
# 10 4 a 152 148 -4
# # ... with 11 more rows
Honestly, I tend to prefer a long format most of the time for many reasons. If, however, you want to make it wide again, then
df %>%
tidyr::pivot_longer(
-subid, names_pattern = "(.*)_score.(.*)",
names_to = c("ltr", ".value")) %>%
mutate(difference = followup - baseline) %>%
tidyr::pivot_wider(
names_from = "ltr",
values_from = c("baseline", "followup", "difference"),
names_glue = "{ltr}_score.{.value}")
# # A tibble: 7 x 10
# subid a_score.baseline b_score.baseline c_score.baseline a_score.followup b_score.followup c_score.followup a_score.difference b_score.difference c_score.difference
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 100 5 80 150 2 70 50 -3 -10
# 2 2 120 10 79 142 9 42 22 -1 -37
# 3 3 111 60 89 146 49 46 35 -11 -43
# 4 4 152 4 69 148 4 48 -4 0 -21
# 5 5 110 20 60 123 18 23 13 -2 -37
# 6 6 112 5 12 120 3 20 8 -2 8
# 7 7 111 6 11 145 4 45 34 -2 34
This is a keep-it-wide (no pivoting), which will be more efficient than the pivot-mutate-pivot above if you have no intention of working on it in a longer format.
df %>%
mutate(across(
ends_with("score.followup"),
~ . - cur_data()[[sub("followup", "baseline", cur_column())]],
.names = "{sub('followup', 'difference', col)}")
)
# subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference b_score.difference c_score.difference
# 1 1 100 150 5 2 80 70 50 -3 -10
# 2 2 120 142 10 9 79 42 22 -1 -37
# 3 3 111 146 60 49 89 46 35 -11 -43
# 4 4 152 148 4 4 69 48 -4 0 -21
# 5 5 110 123 20 18 60 23 13 -2 -37
# 6 6 112 120 5 3 12 20 8 -2 8
# 7 7 111 145 6 4 11 45 34 -2 34
Upvotes: 3