FightMilk
FightMilk

Reputation: 174

Selecting multiple columns using Regular Expressions

I have variables with names such as r1a r3c r5e r7g r9i r11k r13g r15i etc. I am trying to select variables which starts with r5 - r12 and create a dataframe in R.

The best code that I could write to get this done is,

data %>% select(grep("r[5-9][^0-9]" , names(data), value = TRUE ),
grep("r1[0-2]", names(data), value = TRUE))

Given my experience with regular expressions span a day, I was wondering if anyone could help me write a better and compact code for this!

Upvotes: 2

Views: 1484

Answers (3)

Rui Barradas
Rui Barradas

Reputation: 76412

Suppose that in the code below x represents your names(data). Then the following will do what you want.

# The names of 'data'
x <- scan(what = character(), text = "r1a r3c r5e r7g r9i r11k r13g r15i")

y <- unlist(strsplit(x, "[[:alpha:]]"))
y <- as.numeric(y[sapply(y, `!=`, "")])
x[y > 4]
#[1] "r5e"  "r7g"  "r9i"  "r11k" "r13g" "r15i"

EDIT.

You can make a function with a generalization of the above code. This function has three arguments, the first is the vector of variables names, the second and the third are the limits of the numbers you want to keep.

var_names <- function(x, from = 1, to = Inf){
    y <- unlist(strsplit(x, "[[:alpha:]]"))
    y <- as.integer(y[sapply(y, `!=`, "")])
    x[from <= y & y <= to]
}

var_names(x, 5)
#[1] "r5e"  "r7g"  "r9i"  "r11k" "r13g" "r15i"

Upvotes: 2

G. Grothendieck
G. Grothendieck

Reputation: 269644

Remove the non-digits, scan the remainder in and check whether each is in 5:12 :

DF <- data.frame(r1a=1, r3c=2, r5e=3, r7g=4, r9i=5, r11k=6, r13g=7, r15i=8) # test data

DF[scan(text = gsub("\\D", "", names(DF)), quiet = TRUE) %in% 5:12]
##   r5e r7g r9i r11k
## 1   3   4   5    6

Using magrittr it could also be written like this:

library(magrittr)

DF %>% .[scan(text = gsub("\\D", "", names(.)), quiet = TRUE) %in% 5:12]
##   r5e r7g r9i r11k
## 1   3   4   5    6

Upvotes: 1

C. Braun
C. Braun

Reputation: 5201

Here's a regex that gets all the columns at once:

data %>% select(grep("r([5-9]|1[0-2])", names(data), value = TRUE))

The vertical bar represents an 'or'.

As the comments have pointed out, this will fail for items such as r51, and can also be shortened. Instead, you will need a slightly longer regex:

data %>% select(matches("r([5-9]|1[0-2])([^0-9]|$)"))

Upvotes: 2

Related Questions