Tyler Rinker
Tyler Rinker

Reputation: 109844

Grab from beginning to first occurrence of character with gsub

I have the following regex that I'd like to grab everything from the beginning of the sentence until the first ##. I could use strsplit as I demonstrate to do this task but am preferring a gsub solution. If gusub is not the correct tool (I think it is though) I'd prefer a base solution because I want to learn the base regex tools.

x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"

strsplit(x, "##")[[c(1, 1)]]  #works

gsub("(.*)(##.*)", "\\1", x)  #I want to work

Upvotes: 9

Views: 10736

Answers (6)

hadley
hadley

Reputation: 103898

Here's another approach that uses more string tools instead of a more complicated regular expression. It first finds the location of the first ## and then extracts the substring up to that point:

library(stringr)
x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"
loc <- str_locate(x, "##")
str_sub(x, 1, loc[, "start"] - 1)

Generally, I think this sort of step-by-step approach is more maintainable than complex regular expressions.

Upvotes: 3

Matthew Plourde
Matthew Plourde

Reputation: 44614

There are several simpler answers already here, but since you indicated in your question that you'd like to learn about regex support in base R, here's another way, using positive lookahead assertion (?=#) and non-greedy option (?U).

regmatches(x, regexpr('(?U)^.+(?=#)', x, perl=TRUE))
[1] "gfd gdr tsvfvetrv erv tevgergre "

Upvotes: 1

Josh O&#39;Brien
Josh O&#39;Brien

Reputation: 162321

Just add one character, putting a ? after the first quantifier to make it "non-greedy":

gsub("(.*?)(##.*)", "\\1", x) 
# [1] "gfd gdr tsvfvetrv erv tevgergre "

Here's the relevant documentation, from ?regex

By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to 'minimal' by appending '?' to the quantifier.

Upvotes: 21

Andrie
Andrie

Reputation: 179408

In this case, I'd say to the inverse, i.e. replace everything following # with an empty string:

gsub("#.*$", "", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "

But you can also use the non-greedy modifier ? to make your regex work in the way you suggested:

gsub("(.*?)#.*$", "\\1", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "

Upvotes: 4

Sacha Epskamp
Sacha Epskamp

Reputation: 47541

I'd say:

sub("##.*", "", x)

Removes everything including and after the first occurance of ##.

Upvotes: 4

garyh
garyh

Reputation: 2852

Try this as your regex

^[^#]+

starts at the beginning of the string and matches anything not a # up to the first #

Upvotes: 1

Related Questions