Reputation: 4201
I want to rename columns in my data based on whether all values in those columns satisfy a condition. For example, in a numeric column, if all values are greater than 5, rename the column to large_values
, otherwise keep the column name as is. Another example, if all values in a character column are free from the &
character, rename the column to no_ampersand
.
library(tibble)
df <-
tribble(~age_1, ~age_2, ~string_1,
2, 7, "abc",
3, 8, "efg",
1, 11, "hi&",
10, 6, "klmn",
50, 100, "opq")
is_larger_than_val <- function(x, y) {
all(x > y)
}
&
?does_contain_ampersand <- function(x) {
all(grepl("&", x))
}
So using those functions, how can I rename headers of df
such that when
is_larger_than_val(df$age_2, 5)
[1] TRUE
Will rename age_2
to large_values
, but otherwise in case when
is_larger_than_val(df$age_1, 5)
[1] FALSE
will keep age_1
as is?
And similarly, because
does_contain_ampersand(df$string_1)
[1] FALSE
will keep string_1
as is (but if it had been TRUE
then string_1
would have been renamed to no_ampersand
)?
Given the current data, renaming based on the conditions I specified in is_larger_than_val()
and does_contain_ampersand()
should return:
# A tibble: 5 x 3
age_1 large_values string_1
<dbl> <dbl> <chr>
1 2 7 abc
2 3 8 efg
3 1 11 hi&
4 10 6 klmn
5 50 100 opq
I'm sure this could be achieved with nesting some if else
statements, but am wondering if there is a simpler way (maybe using tidyverse
tricks?).
Thanks!
R
data object that originates from a JSON
file. Once I read the JSON
file into R
, it ends up in the following format:vec <- c(1, 2, 3)
names(vec) <- c("A", "B", "C")
my_data_object_as_list <- as.list(vec)
my_data_object_as_list
## $A
## [1] 1
## $B
## [1] 2
## $C
## [1] 3
my_data_object_as_list
and reorganizes it into a table.require(tidyr)
require(dplyr)
require(tidyselect)
organize_in_table <- function(as_list_object) {
as_list_object %>%
bind_rows() %>%
pivot_longer(cols = tidyselect::everything())
}
organize_in_table(my_data_object_as_list)
## # A tibble: 3 x 2
## name value
## <chr> <dbl>
## 1 A 1
## 2 B 2
## 3 C 3
organize_in_table()
is pretty generic, it returns a table with column names (name
and value
) that are not indicative to what each column is about. To address this problem, I want to add a purpose
argument to organize_in_table()
, and here is one example:organize_in_table <-
function(as_list_object,
purpose = NULL) {
table <- as_list_object %>%
bind_rows() %>%
pivot_longer(cols = tidyselect::everything())
if (is.null(purpose)) {
return(table)
} else if (purpose == "match_letters_and_numbers") {
table <- rename(table, letters = name, numbers = value)
}
return(table)
}
Now, organize_in_table()
can return an object that has meaningful names:
df_letters_and_numbers <-
organize_in_table(my_data_object_as_list, "match_letters_and_numbers")
> df_letters_and_numbers
## # A tibble: 3 x 2
## letters numbers
## <chr> <dbl>
## 1 A 1
## 2 B 2
## 3 C 3
df_letters_and_numbers[1]
has been correctly named as letters
whereas df_letters_and_numbers[2]
is numbers
? On what basis have I verified — when I renamed table
within organize_in_table()
— that the first column (name
) is all letters and the second column (value
) is all numbers? I didn't verify.What if my_data_object_as_list
had looked like this:
vec_2 <- c("A", "B", "C")
names(vec_2) <- c(1, 2, 3)
my_data_object_as_list_2 <- as.list(vec_2)
> my_data_object_as_list_2
## $`1`
## [1] "A"
## $`2`
## [1] "B"
## $`3`
## [1] "C"
Then if I would run organize_in_table(my_data_object_as_list_2, "match_letters_and_numbers")
I will get the dataframe with mismatching column names:
## # A tibble: 3 x 2
## letters numbers
## <chr> <chr>
## 1 1 A
## 2 2 B
## 3 3 C
So my conclusion is that I have to condition the renaming step within organize_in_table()
, based on the content of each column being renamed. To that end, I thought that the first step should be to define a separate function for each test I want to perform (e.g., are all values in the column numbers? are all values letters?). And because I want to make organize_in_table()
as scalable as I can, I would like it accept any test that I can build a testing function for.
Upvotes: 1
Views: 928
Reputation: 2727
The sapply
solution is great with it's simplicity. Here is an alternative solution with data.table
that's a bit more effort. I think its the same principle but I've added a step to identify numeric and string columns to ensure only certain columns are being tested
# data.table solution
library(data.table)
dt <- as.data.table(df)
# I think you need !(any()) rather than all
# This will identify if there is no_ampersand.
# Design tests so that if they are TRUE then change the column name for consistency
# Also & needs to be escaped with \\ to find it
no_ampersand <- function(x) {
!any(grepl("\\&", x))
}
# function for taking data.table (dt), the test result and the new column names
# and updating the column name
alterColumnIfMatch <- function(dt, test_data, new_column="large_values"){
# find ones to change
alter_data <- names(test_data)[which(test_data == TRUE)]
# if there are any, then use setnames to update to the new column value
if(length(alter_data) > 0) setnames(dt, alter_data, new_column)
return(dt)
}
# identify which columns to run through the tests
col_class <- sapply(dt, class)
numeric_cols <- names(col_class)[col_class == "numeric" | col_class == "integer"]
character_cols <- names(col_class)[col_class == "character" ]
# test for larger than 5 and update
test_larger <- dt[, lapply(.SD, function(col) is_larger_than_val(col, 5)),
.SDcols = numeric_cols]
dt <- alterColumnIfMatch(dt, test_larger, "large_values")
# test for no ampersand and update
test_ampersand <- dt[, lapply(.SD, function(col) does_contain_ampersand(col)),
.SDcols = character_cols]
dt <- alterColumnIfMatch(dt, test_ampersand, "no_ampersand")
# convert back to tibble for you
out <- as_tibble(dt)
out
Upvotes: 2
Reputation: 76432
Use sapply
on the data in order to create a logical index on the columns, but be careful with the character columns in the ampersand case:
i <- sapply(df, is_larger_than_val, y = 5)
names(df)[i] <- "large_values"
i <- sapply(df, does_contain_ampersand)
i <- i | !sapply(df, is.character)
names(df)[!i] <- "no_ampersand"
names(df)
#[1] "age_1" "large_values" "no_ampersand"
Upvotes: 2