ako
ako

Reputation: 3689

R extract text until, and not including x

I have a bunch of strings of mixed length, but all with a year embedded. I am trying to extract just the text part, that is everything until the number start and am having problem with lookeahead assertions assuming that is the proper way of such extractions.

Here is what I have (returns no match):

>grep("\\b.(?=\\d{4})","foo_1234_bar",perl=T,value=T)

In the example I am looking to extract just foo but there may be several, and of mixed lengths, separated by _ before the year portion.

Upvotes: 1

Views: 593

Answers (3)

Tyler Rinker
Tyler Rinker

Reputation: 109984

Another approach (often I find that strsplit is faster than regex searching but not always (though this does use a slight bit of regexing):

x <- c("asdfas_1987asdf", "asd_das_12") #shamelessly stealing Dason's example
sapply(strsplit(x, "[0-9]+"), "[[", 1)

Upvotes: 2

Scott Weaver
Scott Weaver

Reputation: 7361

Look-aheads may be overkill here. Use the underscore and the 4 digits as the structure, combined with a non-greedy quantifier to prevent the 'dot' from gobbling up everything:

/(.+?)_\d{4}/ 

-first matching group ($1) holds 'foo'

Upvotes: 5

Dason
Dason

Reputation: 61953

This will grab everything up until the first digit

x <- c("asdfas_1987asdf", "asd_das_12")
regmatches(x, regexpr("^[^[:digit:]]*", x))
#[1] "asdfas_"  "asd_das_"

Upvotes: 4

Related Questions