Emman
Emman

Reputation: 4201

Rename a column only if all its values satisfy a condition

I want to rename columns in my data based on whether all values in those columns satisfy a condition. For example, in a numeric column, if all values are greater than 5, rename the column to large_values, otherwise keep the column name as is. Another example, if all values in a character column are free from the & character, rename the column to no_ampersand.

Data

library(tibble)

df <- 
  tribble(~age_1, ~age_2, ~string_1,
        2, 7, "abc",
        3, 8, "efg",
        1, 11, "hi&",
        10, 6, "klmn",
        50, 100, "opq")

Defining two functions for demonstration

  1. Are vector elements greater than a value
is_larger_than_val <- function(x, y) {
  all(x > y)
}
  1. Do vector elements contain &?
does_contain_ampersand <- function(x) {
  all(grepl("&", x))
}

The question

So using those functions, how can I rename headers of df such that when

is_larger_than_val(df$age_2, 5)

[1] TRUE

Will rename age_2 to large_values, but otherwise in case when

is_larger_than_val(df$age_1, 5)

[1] FALSE

will keep age_1 as is?

And similarly, because

does_contain_ampersand(df$string_1)

[1] FALSE

will keep string_1 as is (but if it had been TRUE then string_1 would have been renamed to no_ampersand)?

Desired Output

Given the current data, renaming based on the conditions I specified in is_larger_than_val() and does_contain_ampersand() should return:

# A tibble: 5 x 3
  age_1 large_values string_1
  <dbl>        <dbl> <chr>   
1     2            7 abc     
2     3            8 efg     
3     1           11 hi&     
4    10            6 klmn    
5    50          100 opq  

I'm sure this could be achieved with nesting some if else statements, but am wondering if there is a simpler way (maybe using tidyverse tricks?).

Thanks!


EDIT


Putting the question in context

Following @Ian's comment below, here's the context for my problem.
  1. I have an R data object that originates from a JSON file. Once I read the JSON file into R, it ends up in the following format:
vec <- c(1, 2, 3)
names(vec) <- c("A", "B", "C")
my_data_object_as_list <- as.list(vec)

my_data_object_as_list
## $A
## [1] 1

## $B
## [1] 2

## $C
## [1] 3
  1. I want to build a function that takes my_data_object_as_list and reorganizes it into a table.
require(tidyr)
require(dplyr)
require(tidyselect)

organize_in_table <- function(as_list_object) {
    as_list_object %>%
    bind_rows() %>%
    pivot_longer(cols = tidyselect::everything())
}
organize_in_table(my_data_object_as_list)

## # A tibble: 3 x 2
##   name  value
##   <chr> <dbl>
## 1 A         1
## 2 B         2
## 3 C         3
  1. Because organize_in_table() is pretty generic, it returns a table with column names (name and value) that are not indicative to what each column is about. To address this problem, I want to add a purpose argument to organize_in_table(), and here is one example:
organize_in_table <-
  function(as_list_object,
           purpose = NULL) {
    table <- as_list_object %>%
      bind_rows() %>%
      pivot_longer(cols = tidyselect::everything())
    
    if (is.null(purpose)) {
      return(table)
    } else if (purpose == "match_letters_and_numbers") {
      table <- rename(table, letters = name, numbers = value)
    }
    return(table)
  }

Now, organize_in_table() can return an object that has meaningful names:

df_letters_and_numbers <- 
  organize_in_table(my_data_object_as_list, "match_letters_and_numbers")

> df_letters_and_numbers
## # A tibble: 3 x 2
##   letters numbers
##   <chr>     <dbl>
## 1 A             1
## 2 B             2
## 3 C             3
  1. Here I encounter a problem: How do I know that df_letters_and_numbers[1] has been correctly named as letters whereas df_letters_and_numbers[2] is numbers? On what basis have I verified — when I renamed table within organize_in_table() — that the first column (name) is all letters and the second column (value) is all numbers? I didn't verify.

What if my_data_object_as_list had looked like this:

vec_2 <- c("A", "B", "C")
names(vec_2) <- c(1, 2, 3)
my_data_object_as_list_2 <- as.list(vec_2)

> my_data_object_as_list_2 
## $`1`
## [1] "A"

## $`2`
## [1] "B"

## $`3`
## [1] "C"

Then if I would run organize_in_table(my_data_object_as_list_2, "match_letters_and_numbers") I will get the dataframe with mismatching column names:

## # A tibble: 3 x 2
##   letters numbers
##   <chr>   <chr>  
## 1 1       A      
## 2 2       B      
## 3 3       C     

Bottom line

So my conclusion is that I have to condition the renaming step within organize_in_table(), based on the content of each column being renamed. To that end, I thought that the first step should be to define a separate function for each test I want to perform (e.g., are all values in the column numbers? are all values letters?). And because I want to make organize_in_table() as scalable as I can, I would like it accept any test that I can build a testing function for.

Upvotes: 1

Views: 928

Answers (2)

Jonny Phelps
Jonny Phelps

Reputation: 2727

The sapply solution is great with it's simplicity. Here is an alternative solution with data.table that's a bit more effort. I think its the same principle but I've added a step to identify numeric and string columns to ensure only certain columns are being tested

# data.table solution
library(data.table)
dt <- as.data.table(df)

# I think you need !(any()) rather than all
# This will identify if there is no_ampersand.
# Design tests so that if they are TRUE then change the column name for consistency
# Also & needs to be escaped with \\ to find it
no_ampersand <- function(x) {
  !any(grepl("\\&", x))
}

# function for taking data.table (dt), the test result and the new column names
# and updating the column name
alterColumnIfMatch <- function(dt, test_data, new_column="large_values"){
  # find ones to change
  alter_data <- names(test_data)[which(test_data == TRUE)]
  # if there are any, then use setnames to update to the new column value
  if(length(alter_data) > 0) setnames(dt, alter_data, new_column)
  return(dt)
}

# identify which columns to run through the tests
col_class <- sapply(dt, class)
numeric_cols <- names(col_class)[col_class == "numeric" | col_class == "integer"]
character_cols <- names(col_class)[col_class == "character" ]

# test for larger than 5 and update
test_larger <- dt[, lapply(.SD, function(col) is_larger_than_val(col, 5)),
                  .SDcols = numeric_cols]
dt <- alterColumnIfMatch(dt, test_larger, "large_values")

# test for no ampersand and update
test_ampersand <- dt[, lapply(.SD, function(col) does_contain_ampersand(col)),
                     .SDcols = character_cols]
dt <- alterColumnIfMatch(dt, test_ampersand, "no_ampersand")

# convert back to tibble for you
out <- as_tibble(dt)
out

Upvotes: 2

Rui Barradas
Rui Barradas

Reputation: 76432

Use sapply on the data in order to create a logical index on the columns, but be careful with the character columns in the ampersand case:

i <- sapply(df, is_larger_than_val, y = 5)
names(df)[i] <- "large_values"

i <- sapply(df, does_contain_ampersand)
i <- i | !sapply(df, is.character)
names(df)[!i] <- "no_ampersand"

names(df)
#[1] "age_1"        "large_values" "no_ampersand"

Upvotes: 2

Related Questions