Reputation: 41
Suppose I have a data frame like this:
df = data.frame(preA = c(1,2,3),preB = c(3,4,5),postA = c(6,7,8),postB = c(9,8,4))
I want to add columns having column-wise differences, that is:
diffA = postA - preA
diffB = postB - preB
and so on...
Is there an efficient way to do this in tidyverse?
Upvotes: 4
Views: 449
Reputation: 2748
The best approach is to first get your data into a form where columns identify variables and cells to values. Here you've got a variable with values "A"
and "B"
which does not have its own column, and the result is that this fairly simple operation is quite hard to reason about.
Here's how you can get your data into this form:
library(dplyr)
library(tidyr)
df <- tibble(
preA = c(1, 2, 3),
preB = c(3, 4, 5),
postA = c(6, 7, 8),
postB = c(9, 8, 4)
)
df_transformed <- df |>
pivot_longer(
everything(),
names_pattern = "(pre|post)(A|B)",
names_to = c(".value", "category")
)
df_transformed
#> # A tibble: 6 × 3
#> category pre post
#> <chr> <dbl> <dbl>
#> 1 A 1 6
#> 2 B 3 9
#> 3 A 2 7
#> 4 B 4 8
#> 5 A 3 8
#> 6 B 5 4
Created on 2024-04-17 with reprex v2.1.0
With your data in this form the calculation is trivial:
df2 <- df_transformed |>
mutate(diff = post - pre)
If you really need to get the data back in the original form (which to be honest I don't advise if it can be avoided) you can do so like this:
df2 |>
mutate(id = row_number(), .by = category) |>
pivot_wider(
names_from = category,
values_from = c(pre, post, diff),
names_glue = "{.value}{category}"
)
#> # A tibble: 3 × 7
#> id preA preB postA postB diffA diffB
#> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 3 6 9 5 6
#> 2 2 2 4 7 8 5 4
#> 3 3 3 5 8 4 5 -1
A few more solutions if you don't like pivoting:
I think this option is fairly readable, and I think it should be reasonably fast compared to solutions using across()
:
cols <- c("A", "B")
exprs <- glue("post{cols} - pre{cols}")
names(exprs) <- glue("diff{cols}")
df |>
mutate(!!!rlang::parse_exprs(exprs))
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1
mutate()
+ across()
+ get()
:Personally, I don't like this sort of thing because I think it's really hard to read:
df |>
mutate(across(
starts_with("post"),
~ .x - get(stringr::str_replace_all(cur_column(), "^post", "pre")),
.names = "diff{stringr::str_remove(.col, '^post')}"
))
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1
The main advantage here is that you don't need any packages (you can use paste0()
instead of glue()
), IMO it's also pretty readable. But I don't like that it doesn't play well with |>
:
cols <- c("A", "B")
df2 <- df
df2[glue("diff{cols}")] <- df2[glue("post{cols}")] - df2[glue("pre{cols}")]
df2
#> preA preB postA postB diffA diffB
#> 1 1 3 6 9 5 6
#> 2 2 4 7 8 5 4
#> 3 3 5 8 4 5 -1
Upvotes: 3
Reputation: 34511
You can do this with two uses of across()
, creating new variables with the first use and subtracting the second. This also assumes your columns are in order.
df %>%
mutate(across(starts_with("post"), .names = "diff{sub('post', '', .col)}") - across(starts_with("pre")))
preA preB postA postB diffA diffB
1 1 3 6 9 5 6
2 2 4 7 8 5 4
3 3 5 8 4 5 -1
Upvotes: 3
Reputation: 52069
The way to go with dplyr
and tidyr
:
library(dplyr)
library(tidyr)
df %>%
mutate(id = 1:n()) %>%
pivot_longer(-id,
names_to = c("pre_post", ".value"),
names_pattern = "(pre|post)(.*)") %>%
group_by(id) %>%
mutate(across(A:B, diff, .names = "diff{col}")) %>%
pivot_wider(names_from = pre_post, values_from = c(A, B),
names_glue = '{pre_post}{.value}') %>%
select(id, starts_with("pre"), starts_with("post"), starts_with("diff"))
# id preA preB postA postB diffA diffB
# 1 1 1 3 6 9 5 6
# 2 2 2 4 7 8 5 4
# 3 3 3 5 8 4 5 -1
A shorter but less flexible was with dplyover::across2
:
library(dplyr)
library(dplover)
df %>%
#relocate(sort(colnames(.))) %>%
mutate(across2(starts_with("post"), starts_with("pre"), `-`,
.names = "diff{idx}"))
# preA preB postA postB diff1 diff2
# 1 1 3 6 9 5 6
# 2 2 4 7 8 5 4
# 3 3 5 8 4 5 -1
Upvotes: 2