shea
shea

Reputation: 528

regular expression to match up to first instance of repeated character

My example data:

l1
[1] "xmms-1.2.11-x86_64-5"     "xmms-1.2.11-x86_64-6"    
[3] "xmodmap-1.0.10-x86_64-1"  "xmodmap-1.0.9-x86_64-1"  
[5] "xmodmap3-1.0.10-x86_64-1" "xmodmap3-1.0.9-x86_64-1"

I am using R and would like a regular expression that will capture just the characters before the first dash. Such as

xmms
xmms
xmodmap
xmodmap
xmodmap3
xmodmap3

Since I am using R, the regex needs to be Perl compliant.

I thought I could do this with using a lookbehind on the dash, but I just get a match for the whole string. This is the pattern I tried: grepl("(?<=[a-z0-9])-",l1, perl=T) , but it just matches the whole string. I think if I had the first dash as a capture group, I could maybe use the lookbehind, but I don't know how to build the regex with the lookbehind and the capture group.

I looked around at some other questions for possible answers and it seems maybe I need an non-greedy symbol? I tried grepl("(?<=[a-z0-9])-/.+?(?=-)/",l1, perl=T), but that didn't work either.

I'm open to other suggestions on how to capture the first set of characters before the dash. I'm currently in base R, but I'm fine with using any packages, like stringr.

Upvotes: 0

Views: 948

Answers (3)

akrun
akrun

Reputation: 886968

1) Base R An option is sub from base R to match the - followed by characters (.*) and then replace with blank ("")

sub("-.*", "", l1)
#[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"

Or capture as a group

sub("(\\w+).*", "\\1", l1)
#[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"

Or with regmatches/regexpr

regmatches(l1, regexpr('\\w+', l1))
#[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"

or using trimws

trimws(l1,  "right", whitespace = "-.*")
#[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"

Or using read.table

read.table(text = l1, sep="-", header = FALSE, stringsAsFactors = FALSE)$V1
#[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"

or with strsplit

sapply(strsplit(l1, "-"), `[`, 1)

2) stringr Or with word from stringr

library(stringr)
word(l1, 1, sep="-")

Or with str_remove

str_remove(l1, "-.*")
#[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"

3) stringi Or with stri_extract_first from stringi

library(stringi)
stri_extract_first(l1, regex = "\\w+")
#[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"

Note: grep/grepl is for detecting a pattern in the string. For replacing/extracting substring, use sub/regexpr/regmatches in base R

data

l1 <- c("xmms-1.2.11-x86_64-5", "xmms-1.2.11-x86_64-6", "xmodmap-1.0.10-x86_64-1", 
"xmodmap-1.0.9-x86_64-1", "xmodmap3-1.0.10-x86_64-1", "xmodmap3-1.0.9-x86_64-1"
)

Upvotes: 0

SamWhan
SamWhan

Reputation: 8332

I guess the simplest regex to match what you're after would be

^[^-]+

Match start of string (^) and at least one character (the +) that isn't a - ([^-]).

See it here at regex101.

If you need to capture it, add surrounding parentheses.

^([^-]+)

Upvotes: 0

Ronak Shah
Ronak Shah

Reputation: 388817

You could also extract till first occurrence of "-". Using base R sub

sub("(.*?)-.*", "\\1", l)
#[1] "xmms"     "xmms"     "xmodmap"  "xmodmap"  "xmodmap3" "xmodmap3"

OR with stringr::str_extract

stringr::str_extract(l, "(.*?)(?=-)")

data

l <- c("xmms-1.2.11-x86_64-5","xmms-1.2.11-x86_64-6","xmodmap-1.0.10-x86_64-1",
  "xmodmap-1.0.9-x86_64-1","xmodmap3-1.0.10-x86_64-1" ,"xmodmap3-1.0.9-x86_64-1")

Upvotes: 3

Related Questions