Reputation: 109844
I have the following regex that I'd like to grab everything from the beginning of the sentence until the first ##
. I could use strsplit
as I demonstrate to do this task but am preferring a gsub
solution. If gusub
is not the correct tool (I think it is though) I'd prefer a base solution because I want to learn the base regex tools.
x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"
strsplit(x, "##")[[c(1, 1)]] #works
gsub("(.*)(##.*)", "\\1", x) #I want to work
Upvotes: 9
Views: 10736
Reputation: 103898
Here's another approach that uses more string tools instead of a more complicated regular expression. It first finds the location of the first ## and then extracts the substring up to that point:
library(stringr)
x <- "gfd gdr tsvfvetrv erv tevgergre ## vev fe ## vgrrgf"
loc <- str_locate(x, "##")
str_sub(x, 1, loc[, "start"] - 1)
Generally, I think this sort of step-by-step approach is more maintainable than complex regular expressions.
Upvotes: 3
Reputation: 44614
There are several simpler answers already here, but since you indicated in your question that you'd like to learn about regex support in base R, here's another way, using positive lookahead assertion (?=#)
and non-greedy option (?U)
.
regmatches(x, regexpr('(?U)^.+(?=#)', x, perl=TRUE))
[1] "gfd gdr tsvfvetrv erv tevgergre "
Upvotes: 1
Reputation: 162321
Just add one character, putting a ?
after the first quantifier to make it "non-greedy":
gsub("(.*?)(##.*)", "\\1", x)
# [1] "gfd gdr tsvfvetrv erv tevgergre "
Here's the relevant documentation, from ?regex
By default repetition is greedy, so the maximal possible number of repeats is used. This can be changed to 'minimal' by appending '?' to the quantifier.
Upvotes: 21
Reputation: 179408
In this case, I'd say to the inverse, i.e. replace everything following #
with an empty string:
gsub("#.*$", "", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "
But you can also use the non-greedy modifier ?
to make your regex work in the way you suggested:
gsub("(.*?)#.*$", "\\1", x)
[1] "gfd gdr tsvfvetrv erv tevgergre "
Upvotes: 4
Reputation: 47541
I'd say:
sub("##.*", "", x)
Removes everything including and after the first occurance of ##
.
Upvotes: 4
Reputation: 2852
Try this as your regex
^[^#]+
starts at the beginning of the string and matches anything not a #
up to the first #
Upvotes: 1