Reputation: 41

Tidyverse column-wise differences

Suppose I have a data frame like this:

df = data.frame(preA = c(1,2,3),preB = c(3,4,5),postA = c(6,7,8),postB = c(9,8,4))

I want to add columns having column-wise differences, that is:

diffA = postA - preA
diffB = postB - preB

and so on...
Is there an efficient way to do this in tidyverse?

Upvotes: 4

Answers (3)

wurli

Reputation: 2748

The best approach is to first get your data into a form where columns identify variables and cells to values. Here you've got a variable with values "A" and "B" which does not have its own column, and the result is that this fairly simple operation is quite hard to reason about.

Method 1: use pivoting:

Here's how you can get your data into this form:

library(dplyr)
library(tidyr)

df <- tibble(
  preA = c(1, 2, 3), 
  preB = c(3, 4, 5), 
  postA = c(6, 7, 8),  
  postB = c(9, 8, 4)
)

df_transformed <- df |>
  pivot_longer(
    everything(),
    names_pattern = "(pre|post)(A|B)",
    names_to = c(".value", "category")
  )

df_transformed
#> # A tibble: 6 × 3
#>   category   pre  post
#>   <chr>    <dbl> <dbl>
#> 1 A            1     6
#> 2 B            3     9
#> 3 A            2     7
#> 4 B            4     8
#> 5 A            3     8
#> 6 B            5     4

^{Created on 2024-04-17 with reprex v2.1.0}

With your data in this form the calculation is trivial:

df2 <- df_transformed |>
  mutate(diff = post - pre)

If you really need to get the data back in the original form (which to be honest I don't advise if it can be avoided) you can do so like this:

df2 |> 
  mutate(id = row_number(), .by = category) |>
  pivot_wider(
    names_from = category,
    values_from = c(pre, post, diff),
    names_glue = "{.value}{category}"
  )
#> # A tibble: 3 × 7
#>      id  preA  preB postA postB diffA diffB
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     1     3     6     9     5     6
#> 2     2     2     4     7     8     5     4
#> 3     3     3     5     8     4     5    -1

A few more solutions if you don't like pivoting:

Method 2: Using expressions:

I think this option is fairly readable, and I think it should be reasonably fast compared to solutions using across():

cols         <- c("A", "B")
exprs        <- glue("post{cols} - pre{cols}")
names(exprs) <- glue("diff{cols}")

df |> 
  mutate(!!!rlang::parse_exprs(exprs))
#>   preA preB postA postB diffA diffB
#> 1    1    3     6     9     5     6
#> 2    2    4     7     8     5     4
#> 3    3    5     8     4     5    -1

Method 3: Using `mutate()` + `across()` + `get()`:

Personally, I don't like this sort of thing because I think it's really hard to read:

df |> 
  mutate(across(
    starts_with("post"),
    ~ .x - get(stringr::str_replace_all(cur_column(), "^post", "pre")),
    .names = "diff{stringr::str_remove(.col, '^post')}"
  ))
#>   preA preB postA postB diffA diffB
#> 1    1    3     6     9     5     6
#> 2    2    4     7     8     5     4
#> 3    3    5     8     4     5    -1

Method 4: Using base subsetting:

The main advantage here is that you don't need any packages (you can use paste0() instead of glue()), IMO it's also pretty readable. But I don't like that it doesn't play well with |>:

cols <- c("A", "B")
df2  <- df
df2[glue("diff{cols}")] <- df2[glue("post{cols}")] - df2[glue("pre{cols}")]

df2
#>   preA preB postA postB diffA diffB
#> 1    1    3     6     9     5     6
#> 2    2    4     7     8     5     4
#> 3    3    5     8     4     5    -1

Upvotes: 3

lroha

Reputation: 34511

You can do this with two uses of across(), creating new variables with the first use and subtracting the second. This also assumes your columns are in order.

df %>%
  mutate(across(starts_with("post"), .names = "diff{sub('post', '', .col)}") - across(starts_with("pre")))

  preA preB postA postB diffA diffB
1    1    3     6     9     5     6
2    2    4     7     8     5     4
3    3    5     8     4     5    -1

Upvotes: 3

Maël

Reputation: 52069

The way to go with dplyr and tidyr:

library(dplyr)
library(tidyr)
df %>% 
  mutate(id = 1:n()) %>% 
  pivot_longer(-id,
               names_to = c("pre_post", ".value"),
               names_pattern = "(pre|post)(.*)") %>% 
  group_by(id) %>% 
  mutate(across(A:B, diff, .names = "diff{col}")) %>% 
  pivot_wider(names_from = pre_post, values_from = c(A, B),
              names_glue = '{pre_post}{.value}') %>% 
  select(id, starts_with("pre"), starts_with("post"), starts_with("diff"))

#      id  preA  preB postA postB diffA diffB
# 1     1     1     3     6     9     5     6
# 2     2     2     4     7     8     5     4
# 3     3     3     5     8     4     5    -1

A shorter but less flexible was with dplyover::across2:

library(dplyr)
library(dplover)
df %>% 
  #relocate(sort(colnames(.))) %>% 
  mutate(across2(starts_with("post"), starts_with("pre"), `-`,
                 .names = "diff{idx}"))

#      preA  preB postA postB diff1 diff2
# 1       1     3     6     9     5     6
# 2       2     4     7     8     5     4
# 3       3     5     8     4     5    -1

Upvotes: 2