Reputation: 103

Subsititute everything except an specific regular expression from a list in R

I want to substitute everything from a list that does NOT match a given pattern. I am using R version 3.1.3 (2015-03-09) -- "Smooth Sidewalk"

The example list I have is:

y <- c("D CCNA_01234 This is example 1 bis", "D CCNA_02345 This is example 2", "D CCNA_12345 This is example 3", "D CCNA_23468 This is example 4")

and the pattern I want to match is CCNA_01234 where the numbers are not the same in each case but always are 5 digits.

The desired output is:

"CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

so far I have removed the previous part to the match by:

y_begin_rm <- sub("D ", "", y)

but I have issues in recognizing the match with the [^match] expression.

y_CCNA_numbers <- sub("[^CCNA_[0-9][0-9][0-9][0-9][0-9]]*$", "", y_begin_rm)

that produces the output:

[1] "CCNA_01234 This is example 1 bis" "CCNA_02345 This is example 2"
[3] "CCNA_12345 This is example 3" "CCNA_23468 This is example 4"

It seems that the issue is the numbers specified in the match are looked entirely through the string and not in the exact combination that I want. So the number after the phrase "this is example " is making a lot of troubles. When I omit the digits or place a digit that is only after the CCNA_string it works just fine:

y_CCNA <- sub("[^CCNA_]*$", "", y_begin_rm)

reults in

[1] "CCNA_" "CCNA_" "CCNA_" "CCNA_"

y_CCNA_0 <- sub("[^CCNA_0]*$", "", y_begin_rm[1])

results in

[1] "CCNA_0"

Is there a way to specify the exact pattern I am looking for (CCNA_[0-9][0-9][0-9][0-9][0-9])? Also, is there a possible way to do it in a single step (remove before and after the match in a single regular expression)?

Thanks in advance!

Upvotes: 3

Answers (3)

G. Grothendieck

Reputation: 270010

Here are a few ways:

1) strapplyc. This uses a particularly simple pattern. It makes use of strapplyc in the gsubfn package:

library(gsubfn)
strapplyc(y, "CCNA_\\d{5}", simplify = TRUE)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

Here is a visualization of the regular expression:

CCNA_\d{5}

Regular expression visualization

Debuggex Demo

1a) If the only occurrences of CCNA_ are before 5 digits then we can simplify the previous solution slightly like this:

strapplyc(y, "CCNA_.{5}", simplify = TRUE)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

2) sub. The pattern here is slightly more complicated but using sub we can do it without any addon packages:

sub(".*(CCNA_\\d{5}).*", "\\1", y)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

3) strsplit If the portion wanted is always the second "word" (which is the case in the question) then this would work and again requires no packages:

sapply(strsplit(y, " "), "[", 2)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

4) substr If the desired portion is always characters 3 through 12 as it is in the question then we could use substr or substring, again, without any packages:

substr(y, 3, 12)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

Upvotes: 5

Tyler Rinker

Reputation: 109994

Here's an approach using a package I maintain qdapRegex (I prefer this or stringi/stringr) to base for consistency and ease of use. I also show a base approach. In any event I'd look at this more as an "extraction" problem than a "sub everything but" subbing problem.

y <- c("D CCNA_01234 This is example 1 bis", "D CCNA_02345 This is example 2", 
    "D CCNA_12345 This is example 3", "D CCNA_23468 This is example 4")

library(qdapRegex)
unlist(rm_default(y, pattern = "CCNA_\\d{5}", extract = TRUE))

## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

In base R:

unlist(regmatches(y, gregexpr("CCNA_\\d{5}", y)))

## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

Upvotes: 4

David Arenburg

Reputation: 92302

With base R you could simply do directly from your original vector y

sub(".*(CCNA_\\d+).*", "\\1", y)
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

Another option is to use stringi

library(stringi)
stri_extract_first_regex(y, "CCNA_\\d+")
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

If you have more than 1 CCNA pattern in each string use stri_extract_all_regex instead

In case you want to match exactly 5 digits after CCNA_ you could also do

stri_extract_first_regex(y, "CCNA_\\d{5}")
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

And of course similarly with stringr

library(stringr)
str_extract(y, "CCNA_\\d{5}")
## [1] "CCNA_01234" "CCNA_02345" "CCNA_12345" "CCNA_23468"

Upvotes: 5

Subsititute everything except an specific regular expression from a list in R

Answers (3)

Related Questions