Reputation: 57

Calculating the difference between two columns based on similar names in R

I have a large dataframe with 400 columns of baseline and follow-up scores (and 10,000 subjects). Each alphabet represents a score and I would like to calculate the difference between the follow-up and baseline for each score in a new column:

subid	a_score.baseline	a_score.followup	b_score.baseline	b_score.followup	c_score.baseline	c_score.followup
1	100	150	5	2	80	70
2	120	142	10	9	79	42
3	111	146	60	49	89	46
4	152	148	4	4	69	48
5	110	123	20	18	60	23
6	112	120	5	3	12	20
7	111	145	6	4	11	45

I'd like to calculate the difference between followup and baseline for each score in a new column like this:

df$a_score_difference = df$a_score.followup - df$a_score.baseleine

Any ideas on how to do this efficiently? I really appreciate your help.

code to generate sample data:

subid <- c(1:7)
a_score.baseline <- c(100,120,111,152,110,112,111)
a_score.followup <- c(150,142,146,148,123,120,145)
b_score.baseline <- c(5,10,60,4,20,5,6)
b_score.followup <- c(2,9,49,4,18,3,4)
c_score.baseline <- c(80,79,89,69,60,12,11)
c_score.followup <- c(70,42,46,48,23,20,45)

df <- data.frame(subid,a_score.baseline,a_score.followup,b_score.baseline,b_score.followup,c_score.baseline,c_score.followup)

Upvotes: 0

Answers (1)

r2evans

Reputation: 160952

base R

scores <- sort(grep("score\\.(baseline|followup)", names(df), value = TRUE))
scores
# [1] "a_score.baseline" "a_score.followup" "b_score.baseline" "b_score.followup" "c_score.baseline" "c_score.followup"
scores <- split(scores, sub(".*_", "", scores))
scores
# $score.baseline
# [1] "a_score.baseline" "b_score.baseline" "c_score.baseline"
# $score.followup
# [1] "a_score.followup" "b_score.followup" "c_score.followup"
Map(`-`, df[scores[[2]]], df[scores[[1]]])
# $a_score.followup
# [1] 50 22 35 -4 13  8 34
# $b_score.followup
# [1]  -3  -1 -11   0  -2  -2  -2
# $c_score.followup
# [1] -10 -37 -43 -21 -37   8  34
out <- Map(`-`, df[scores[[2]]], df[scores[[1]]])
names(out) <- sub("followup", "difference", names(out))
df <- cbind(df, out)
df
#   subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference
# 1     1              100              150                5                2               80               70                 50
# 2     2              120              142               10                9               79               42                 22
# 3     3              111              146               60               49               89               46                 35
# 4     4              152              148                4                4               69               48                 -4
# 5     5              110              123               20               18               60               23                 13
# 6     6              112              120                5                3               12               20                  8
# 7     7              111              145                6                4               11               45                 34
#   b_score.difference c_score.difference
# 1                 -3                -10
# 2                 -1                -37
# 3                -11                -43
# 4                  0                -21
# 5                 -2                -37
# 6                 -2                  8
# 7                 -2                 34

There exists (in an unsupervised mode) the possibility that not all followups will have comparable baselines, which could cause a problem. You might include a test to validate the presence and order:

all(sub("baseline", "followup", scores$score.baseline) == scores$score.followup)
# [1] TRUE

dplyr

You might consider pivoting the data into a more long format. This can be done in base R as well, but looks a lot simpler when done here:

library(dplyr)
# library(tidyr) # pivot_*
df %>%
  tidyr::pivot_longer(
    -subid,
    names_pattern = "(.*)_score.(.*)", 
    names_to = c("ltr", ".value")) %>%
  mutate(difference = followup - baseline)
# # A tibble: 21 x 5
#    subid ltr   baseline followup difference
#    <int> <chr>    <dbl>    <dbl>      <dbl>
#  1     1 a          100      150         50
#  2     1 b            5        2         -3
#  3     1 c           80       70        -10
#  4     2 a          120      142         22
#  5     2 b           10        9         -1
#  6     2 c           79       42        -37
#  7     3 a          111      146         35
#  8     3 b           60       49        -11
#  9     3 c           89       46        -43
# 10     4 a          152      148         -4
# # ... with 11 more rows

Honestly, I tend to prefer a long format most of the time for many reasons. If, however, you want to make it wide again, then

df %>%
  tidyr::pivot_longer(
    -subid, names_pattern = "(.*)_score.(.*)", 
    names_to = c("ltr", ".value")) %>%
  mutate(difference = followup - baseline) %>%
  tidyr::pivot_wider(
    names_from = "ltr", 
    values_from = c("baseline", "followup", "difference"), 
    names_glue = "{ltr}_score.{.value}")
# # A tibble: 7 x 10
#   subid a_score.baseline b_score.baseline c_score.baseline a_score.followup b_score.followup c_score.followup a_score.difference b_score.difference c_score.difference
#   <int>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>            <dbl>              <dbl>              <dbl>              <dbl>
# 1     1              100                5               80              150                2               70                 50                 -3                -10
# 2     2              120               10               79              142                9               42                 22                 -1                -37
# 3     3              111               60               89              146               49               46                 35                -11                -43
# 4     4              152                4               69              148                4               48                 -4                  0                -21
# 5     5              110               20               60              123               18               23                 13                 -2                -37
# 6     6              112                5               12              120                3               20                  8                 -2                  8
# 7     7              111                6               11              145                4               45                 34                 -2                 34

dplyr #2

This is a keep-it-wide (no pivoting), which will be more efficient than the pivot-mutate-pivot above if you have no intention of working on it in a longer format.

df %>%
  mutate(across(
    ends_with("score.followup"),
    ~ . - cur_data()[[sub("followup", "baseline", cur_column())]], 
    .names = "{sub('followup', 'difference', col)}")
  )
#   subid a_score.baseline a_score.followup b_score.baseline b_score.followup c_score.baseline c_score.followup a_score.difference b_score.difference c_score.difference
# 1     1              100              150                5                2               80               70                 50                 -3                -10
# 2     2              120              142               10                9               79               42                 22                 -1                -37
# 3     3              111              146               60               49               89               46                 35                -11                -43
# 4     4              152              148                4                4               69               48                 -4                  0                -21
# 5     5              110              123               20               18               60               23                 13                 -2                -37
# 6     6              112              120                5                3               12               20                  8                 -2                  8
# 7     7              111              145                6                4               11               45                 34                 -2                 34

Upvotes: 3

subid	a_score.baseline	a_score.followup	b_score.baseline	b_score.followup	c_score.baseline	c_score.followup
1	100	150	5	2	80	70
2	120	142	10	9	79	42
3	111	146	60	49	89	46
4	152	148	4	4	69	48
5	110	123	20	18	60	23
6	112	120	5	3	12	20
7	111	145	6	4	11	45

subid	a_score.baseline	a_score.followup	b_score.baseline	b_score.followup	c_score.baseline	c_score.followup
1	100	150	5	2	80	70
2	120	142	10	9	79	42
3	111	146	60	49	89	46
4	152	148	4	4	69	48
5	110	123	20	18	60	23
6	112	120	5	3	12	20
7	111	145	6	4	11	45

Calculating the difference between two columns based on similar names in R

Answers (1)

base R

dplyr

dplyr #2

Related Questions

subid	a_score.baseline	a_score.followup	b_score.baseline	b_score.followup	c_score.baseline	c_score.followup
1	100	150	5	2	80	70
2	120	142	10	9	79	42
3	111	146	60	49	89	46
4	152	148	4	4	69	48
5	110	123	20	18	60	23
6	112	120	5	3	12	20
7	111	145	6	4	11	45