Reputation: 148
I would like to compare variable pairs across the entire data frame and create a flag variable that shows if they have the same value or not. Within my real dataset I have hundreds of these variable pairs.
The pairs have a similar name structure e.g. var and sum_var and an example in the iris dataset is Sepal.Length and sum_Sepal.Length
The end result should be extra variables named var_flag e.g. Sepal.Length_flag for each of the pairs with values 1 for value match (all Petal flags) otherwise 0 (all Sepal flags) for no match.
Any help shall be appreciated, especially using tidyverse. Thanks.
Data:
library(tidyverse)
local_iris <- iris %>%
mutate(
across(all_of(starts_with("Sepal")), ~ sum(.x, na.rm = TRUE), .names = "sum_{.col}"),
across(all_of(starts_with("Petal")), ~ .x + 0, .names = "sum_{.col}")
)
Upvotes: 0
Views: 55
Reputation: 18702
local_iris |>
mutate(across(starts_with("sum_"),
\(x) +(x == pick(all_of(str_sub(cur_column(), 5)))[[1]]),
.names = "{.col}_flag"))
How it works
"sum_"
. That gives the most flexibility so you don't have to hardcode all the prefixes (e.g. "Sepal", "Petal", etc.).cur_column()
gives you the string name of the current column you are iterating over in across()
. We use that to keep everything from the fifth character to the end. This effectively removes sum_
leaving just "Sepal.Length"
for example.all_of()
takes a vector of strings to look up a column name in the tidyverse. It is strict so it will error if it cannot find that column name (as opposed to any_of()
).pick()
and then extract ([[
) that one column into a vector for comparison ==
. This comparison returns a logical vector that we convert into numeric with +()
..names
argument of across()
.Output
Only showing the structure of the first few rows:
'data.frame': 6 obs. of 13 variables:
$ Sepal.Length : num 5.1 4.9 4.7 4.6 5 5.4
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9
$ Petal.Length : num 1.4 1.4 1.3 1.5 1.4 1.7
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1
$ sum_Sepal.Length : num 876 876 876 876 876 ...
$ sum_Sepal.Width : num 459 459 459 459 459 ...
$ sum_Petal.Length : num 1.4 1.4 1.3 1.5 1.4 1.7
$ sum_Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4
$ Sepal.Length_flag: int 0 0 0 0 0 0
$ Sepal.Width_flag : int 0 0 0 0 0 0
$ Petal.Length_flag: int 1 1 1 1 1 1
$ Petal.Width_flag : int 1 1 1 1 1 1
Upvotes: 2
Reputation: 17544
There are probably more compact approaches, but here's one for a starter:
library(tidyverse)
local_iris <- iris %>%
mutate(
across(all_of(starts_with("Sepal")), ~ sum(.x, na.rm = TRUE), .names = "sum_{.col}"),
across(all_of(starts_with("Petal")), ~ .x + 0, .names = "sum_{.col}")
)
# variables to test, all that start with "sum_")
vars <-
names(local_iris) %>%
str_subset("^sum_") %>%
str_remove("^sum_")
# example data used floats,
# so let's test for near equality in 2 col frame, same tolerance as for all.equal()
# +() turns boolean to numeric
col_eq <- function(df_, tolerance = sqrt(.Machine$double.eps)) {
+(abs(df_[[1]] - df_[[2]]) <= tolerance)
}
vars %>%
# prepare flag colum names
set_names(str_c(.,"_flag")) %>%
# test each pair for near equality
map(\(x) select(local_iris, ends_with(x)) %>% col_eq()) %>%
# bind flag vectors and local_iris
bind_cols(local_iris) %>%
glimpse()
Result:
#> Rows: 150
#> Columns: 13
#> $ Sepal.Length_flag <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ Sepal.Width_flag <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ Petal.Length_flag <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ Petal.Width_flag <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
#> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.…
#> $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.…
#> $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.…
#> $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.…
#> $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, seto…
#> $ sum_Sepal.Length <dbl> 876.5, 876.5, 876.5, 876.5, 876.5, 876.5, 876.5, 876…
#> $ sum_Sepal.Width <dbl> 458.6, 458.6, 458.6, 458.6, 458.6, 458.6, 458.6, 458…
#> $ sum_Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.…
#> $ sum_Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.…
Created on 2024-06-07 with reprex v2.1.0
Upvotes: 2