JontroPothon
JontroPothon

Reputation: 600

Creating a column containing variable names of a data frame based on values in the columns

I have this data frame,

id <- seq(1:4)
var_1 <- c(TRUE, FALSE, FALSE, FALSE)
var_2 <- c(FALSE, TRUE, FALSE, FALSE)
var_3 <- c(FALSE, TRUE, FALSE, FALSE)
var_4 <- c(FALSE, FALSE, TRUE, FALSE)
var_5 <- c(FALSE, TRUE, TRUE, FALSE)
df <- data.frame(id, var_1, var_2, var_3, var_4, var_5)
df
> df
  id var_1 var_2 var_3 var_4 var_5
1  1  TRUE FALSE FALSE FALSE FALSE
2  2 FALSE  TRUE  TRUE FALSE  TRUE
3  3 FALSE FALSE FALSE  TRUE  TRUE
4  4 FALSE FALSE FALSE FALSE FALSE

Now, I want to look on every row and see if any of the columns have TRUE values. If yes, I want to record it in a new column result. For the first row, I have TRUE in only var_1. So I record, the variable name where TRUE found, i.e. var_1. For second row, I have TRUE in var_2, var_3 and var_5. The entry in the result column should be var_2, var_3, var_5. Lastly, row 4 has no TRUE, so it should be NA. The resulting data.frame should look like this:

 id var_1 var_2 var_3 var_4 var_5              result
1  1  TRUE FALSE FALSE FALSE FALSE               var_1
2  2 FALSE  TRUE  TRUE FALSE  TRUE var_2, var_3, var_5
3  3 FALSE FALSE FALSE  TRUE  TRUE        var_4, var_5
4  4 FALSE FALSE FALSE FALSE FALSE                <NA>

I can do this,

library(tidyverse)
df |> 
  mutate(across(where(is.logical),
                ~ case_when(.x ~ cur_column()), .names = "{.col}")) |> 
  unite("result", var_1:var_5, sep = ", ", na.rm = TRUE)

  id              result
1  1               var_1
2  2 var_2, var_3, var_5
3  3        var_4, var_5
4  4  

However, the problem is the number of variables should not be fixed to 5. It should be scalable to any variable number. Any idea how to achive that?

Upvotes: 5

Views: 128

Answers (5)

G. Grothendieck
G. Grothendieck

Reputation: 270045

1) dplyr Define a function toNameString that takes a one-row data frame (or named vector) of logical columns (or components) and returns a comma and space separated character string of the column names corresponding to TRUE. Then apply it to the non-id columns in each row.

(If you need to apply this to a data frame that has other columns as well then change pick(everything()) appropriately. For example, pick(where(is.logical)) would pick out only the logical columns or pick(starts_with("var_")) would pick out only the columns whose name starts with var_ .)

library(dplyr)

toNameString <- function(x) toString(names(x)[unlist(x)])
df |> mutate(result = toNameString(pick(everything())), .by = id)

giving

  id var_1 var_2 var_3 var_4 var_5              result
1  1  TRUE FALSE FALSE FALSE FALSE               var_1
2  2 FALSE  TRUE  TRUE FALSE  TRUE var_2, var_3, var_5
3  3 FALSE FALSE FALSE  TRUE  TRUE        var_4, var_5
4  4 FALSE FALSE FALSE FALSE FALSE                    

2) Base R Using only base R and toNameString from above apply it to each row except for column 1.

df |>
  list(x = _) |>
  with(transform(x, result = apply(x[-1], 1, toNameString)))

Note

Input

df <- data.frame(
  id = 1:4,
  var_1 = c(TRUE, FALSE, FALSE, FALSE),
  var_2 = c(FALSE, TRUE, FALSE, FALSE),
  var_3 = c(FALSE, TRUE, FALSE, FALSE),
  var_4 = c(FALSE, FALSE, TRUE, FALSE),
  var_5 = c(FALSE, TRUE, TRUE, FALSE)
)

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 40171

A purrr solution inspired by @jpsmith:

df %>%
 mutate(result = pmap_chr(across(var_1:var_5), 
                          ~ if(any(c(...))) {toString(names(c(...))[c(...)])}
                          else {NA_character_}))

  id var_1 var_2 var_3 var_4 var_5              result
1  1  TRUE FALSE FALSE FALSE FALSE               var_1
2  2 FALSE  TRUE  TRUE FALSE  TRUE var_2, var_3, var_5
3  3 FALSE FALSE FALSE  TRUE  TRUE        var_4, var_5
4  4 FALSE FALSE FALSE FALSE FALSE                <NA>

Upvotes: 1

ThomasIsCoding
ThomasIsCoding

Reputation: 102529

  • Option 1
nms <- names(df)[-1]
df$result <- by(unlist(df[-1]), row(df[-1]), \(v) ifelse(all(!v), NA, toString(nms[v])))

which gives

  id var_1 var_2 var_3 var_4 var_5              result
1  1  TRUE FALSE FALSE FALSE FALSE               var_1
2  2 FALSE  TRUE  TRUE FALSE  TRUE var_2, var_3, var_5
3  3 FALSE FALSE FALSE  TRUE  TRUE        var_4, var_5
4  4 FALSE FALSE FALSE FALSE FALSE                <NA>
  • Option 2
df %>%
  left_join(
    df %>%
      pivot_longer(
        cols = !id,
        names_to = "var",
        values_to = "value"
      ) %>%
      reframe(result = toString(var[value]), .by = id),
    by = "id"
  )

which goes to

  id var_1 var_2 var_3 var_4 var_5              result
1  1  TRUE FALSE FALSE FALSE FALSE               var_1
2  2 FALSE  TRUE  TRUE FALSE  TRUE var_2, var_3, var_5
3  3 FALSE FALSE FALSE  TRUE  TRUE        var_4, var_5
4  4 FALSE FALSE FALSE FALSE FALSE

Upvotes: 2

jpsmith
jpsmith

Reputation: 17656

In base R you can try:

df$new_col <- apply(df[ , -1], 1, \(x) {
  if (any(x)) {paste(names(df)[-1][x], collapse = ", ")} 
    else {NA}
})

Output:

#   id var_1 var_2 var_3 var_4 var_5             new_col
# 1  1  TRUE FALSE FALSE FALSE FALSE               var_1
# 2  2 FALSE  TRUE  TRUE FALSE  TRUE var_2, var_3, var_5
# 3  3 FALSE FALSE FALSE  TRUE  TRUE        var_4, var_5
# 4  4 FALSE FALSE FALSE FALSE FALSE                <NA>

Here I did df[,-1] to remove the id column, but your data have multiple columns that are a mix between what you want to use and what you want to exclude, you could tweak to below. Here, grep identifies all columns with var in the name (wantcols), but you could define it however you want:

# define columns of interest
wantcols <- grep("var", names(df), value = TRUE)

df$new_col <- apply(df[wantcols], 1, \(x) {
  if (any(x)) {paste(names(df[wantcols])[x], collapse = ", ")} 
    else {NA}
})

Upvotes: 1

langtang
langtang

Reputation: 24845

You can do the following:

left_join(
  df, 
  pivot_longer(df,cols=starts_with("var_")) |> 
    filter(value==T) |> 
    reframe(result = toString(name),.by=id)
)

Output:

  id var_1 var_2 var_3 var_4 var_5              result
1  1  TRUE FALSE FALSE FALSE FALSE               var_1
2  2 FALSE  TRUE  TRUE FALSE  TRUE var_2, var_3, var_5
3  3 FALSE FALSE FALSE  TRUE  TRUE        var_4, var_5
4  4 FALSE FALSE FALSE FALSE FALSE                <NA>

You can also do this without pivoting longer.. For example:

df$result <- apply(df[,-1], 1, \(x) toString(names(x)[x]))

Output:

 id var_1 var_2 var_3 var_4 var_5              result
1  1  TRUE FALSE FALSE FALSE FALSE               var_1
2  2 FALSE  TRUE  TRUE FALSE  TRUE var_2, var_3, var_5
3  3 FALSE FALSE FALSE  TRUE  TRUE        var_4, var_5
4  4 FALSE FALSE FALSE FALSE FALSE                    

Upvotes: 3

Related Questions