Namra
Namra

Reputation: 359

Renaming multiple columns using regexp

Problem:

I want to rename a large number of column names by replacing certain repeated strings.

Reprex:

library(dplyr)
library(stringr)

code <- c(round(runif(26, 0, 100),0))
names <- letters
AIYN <- stringi::stri_rand_strings(26, 2)
SIYN <- stringi::stri_rand_strings(26, 2)


df <- bind_cols(code, names, AIYN, SIYN)
colnames(df) <- c("code (2021)", "names (2021)", "all the info you need (AIYN) from A to Z", 
                  "some info you need (SIYN) from A to Z")
View(df)

Attempted Solution

colnames(df) <- str_replace_all(colnames(df), "[(2021)]", "")
colnames(df) <- str_replace_all(colnames(df), "all the info you need (AIYN) from A to Z", "AIYN")
colnames(df) <- str_replace_all(colnames(df), "some info you need (SIYN) from A to Z", "SIYN")

Goal

I want to remove brackets with numbers in them (e.g. "(2019)"), and keep the characters in the brackets with only characters in them (e.g. "(AIYN)", "(SIYN)"). My solution is long-winded as my dataframe has over a hundred columns.

Upvotes: 1

Views: 221

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

To remove brackets with numbers you need

stringr::str_replace_all(colnames(df), "\\s*\\(\\d+\\)", "")
stringr::str_remove_all(colnames(df), "\\s*\\(\\d+\\)")
gsub("\\s*\\(\\d+\\)", "", colnames(df))

If the numbers inside parentheses must consist of 4 digits, replace \d+ with \d{4}.

Put the above code inside trimws(...) to stirp leading/trailing whitespace.

See the regex demo.

To keep the first letter-only value inside parentheses you need

stringr::str_extract(colnames(df), '(?<=\\()[A-Za-z]+(?=\\))') # ASCII only
stringr::str_extract(colnames(df), '(?<=\\()\\p{L}+(?=\\))')   # Any Unicode

Combining both:

colnames(df) <- coalesce(str_extract(colnames(df), '(?<=\\()[A-Za-z]+(?=\\))'), str_replace_all(colnames(df), "\\s*\\(\\d+\\)", ""))

R test

library(dplyr)
library(stringr)

x <-  c("code (2021)", "names (2021)", "all the info you need (AIYN) from A to Z", 
        "some info you need (SIYN) from A to Z")

z <- str_replace_all(x, "\\s*\\(\\d+\\)", "")
# => [1] "code" "names" "all the info you need (AIYN) from A to Z" [4] "some info you need (SIYN) from A to Z"
y <- str_extract(z, '(?<=\\()[A-Za-z]+(?=\\))')
# => [1] NA     NA     "AIYN" "SIYN"
coalesce(y, z)
# => "code"  "names" "AIYN"  "SIYN" 

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 388862

You can try -

library(magrittr)

names(df) <- sub('\\s\\(\\d+\\)', '', names(df)) %>%
                sub('.*\\(([A-Z]+)\\).*', '\\1', .)
names(df)
#[1] "code"  "names" "AIYN"  "SIYN" 

The first sub drops the a number inside a parenthesis along with whitespaces.

The second sub extracts more than one [A-Z] values inside parenthesis.


To use this with dplyr and pipes -

library(dplyr)
df %>% 
    rename_with(~sub('\\s\\(\\d+\\)', '', .) %>% 
                 sub('.*\\(([A-Z]+)\\).*', '\\1', .))

#    code names AIYN  SIYN 
#   <dbl> <chr> <chr> <chr>
# 1     1 a     1A    NR   
# 2    96 b     Dq    hi   
# 3    46 c     28    AQ   
# 4    78 d     Y8    xH   
# 5    76 e     ps    ES   
# 6    56 f     m5    gQ   
# 7    51 g     vV    8u   
# 8    72 h     Hw    JV   
# 9    24 i     0T    7A   
#10    76 j     mq    Qy   
# … with 16 more rows

Upvotes: 1

Related Questions