Doe
Doe

Reputation: 27

R - select a regular expression

I want to select every lines in which we can find the expression "X01" or "X02" :

dataEx <- data.frame(code = c("X01-X043","X034","X024","X015-X036-X033","X012","X015-X042","X019","X036","X022-X043"),res = NA )
pat1 <- c("(^|-)X01($|-|.)","(^|-)X02($|-|.)")
dataEx$res[grep(paste(pat1,collapse="|"),dataEx$code)] <- "ok"

It works correctly and gives me the result :

            code  res
1       X01-X043   ok
2           X034 <NA>
3           X024   ok
4 X015-X036-X033   ok
5           X012   ok
6      X015-X042   ok
7           X019   ok
8           X036 <NA>
9      X022-X043   ok

But I would like to know which pattern is found :

            code  res
1       X01-X043   X01
2           X034 <NA>
3           X024   X024
4 X015-X036-X033   X015
5           X012   X012
6      X015-X042   X015
7           X019   X019
8           X036 <NA>
9      X022-X043   X022

I am very new to regular expression. Is there an easy way to do it ? (In reality, "pat1" is much longer, I am looking for 20 different patterns)

Upvotes: 1

Views: 66

Answers (3)

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21400

You can use str_extractin this way:

library(stringr)
dataEx$res <- str_extract(dataEx$code, "X0(1|2)\\d?")

Here, we are looking to match literal X0followed by either 1OR 2followed by another optional digit.

Result:

dataEx
            code  res
1       X01-X043  X01
2           X034 <NA>
3           X024 X024
4 X015-X036-X033 X015
5           X012 X012
6      X015-X042 X015
7           X019 X019
8           X036 <NA>
9      X022-X043 X022

Upvotes: 1

Onyambu
Onyambu

Reputation: 79208

You could do:

a <- regmatches(dataEx$code, gregexpr(paste(pat1, collapse = "|"), dataEx$code))
is.na(a)<-lengths(a)==0

dataEx$res <- unlist(a)

The question though is what if there is more than one match on one row?

Upvotes: 0

John Girardot
John Girardot

Reputation: 371

Are you open to using the stringr package? I agree with Jaskeil, I tend to prefer data.table over data.frame but that is primarily for execution speed. Not sure if that will be a concern for your application.

library(stringr)
dataEx <- data.frame(code = c("X01-X043","X034","X024","X015-X036-X033","X012","X015-X042","X019","X036","X022-X043"),res = NA )
dataEx$res <- str_extract(dataEx$code, "((^|-)X01($|-|.))|((^|-)X02($|-|.))")

Upvotes: 0

Related Questions