Reputation: 37085
I have a character vector that contains text similar to the following:
text <- c("ABc.def.xYz", "ge", "lmo.qrstu")
I would like to remove everything before a .
:
> "xYz" "ge" "qrstu"
However, the grep
function seems to be treating .
as a letter:
pattern <- "([A-Z]|[a-z])+$"
grep(pattern, text, value = T)
> "ABc.def.xYz" "ge" "lmo.qrstu"
The pattern works elsewhere, such as on regexpal.
How can I get grep
to behave as expected?
Upvotes: 5
Views: 635
Reputation: 887431
grep
is for finding the pattern. It returns the index of the vector that matches a pattern. If, value=TRUE
is specified, it returns the value. From the description, it seems that you want to remove substring instead of returning a subset of the initial vector.
If you need to remove the substring, you can use sub
sub('.*\\.', '', text)
#[1] "xYz" "ge" "qrstu"
As the first argument, we match a pattern i.e. '.*\\.'
. It matches one of more characters (.*
) followed by a dot (\\.
). The \\
is needed to escape the .
to treat it as that symbol instead of any character. This will match until the last .
character in the string. We replace that matched pattern with a ''
as the replacement argument and thereby remove the substring.
Upvotes: 7
Reputation: 11607
Your pattern does work, the problem is that grep
does something different than what you are thinking it does.
Let's first use your pattern with str_extract_all
from the package stringr
.
library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"
[[2]]
[1] "ge"
[[3]]
[1] "qrstu"
Notice that the results came as you expected!
The problem you are having is that grep
will give you the complete element that matches you regular expression and not only the matching part of the element. For example, in the example below, grep will return you the first element because it matches "a":
grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"
Upvotes: 2
Reputation: 174756
You may try str_extract
function from stringr
package.
str_extract(text, "[^.]*$")
This would match all the non-dot characters exists at the last.
Upvotes: 4
Reputation: 61953
grep
doesn't do any replacements. It searches for matches and returns the indices (or the value if you specify value=T) that give a match. The results you're getting are just saying that those meet your criteria at some point in the string. If you added something that doesn't meet the criteria anywhere into your text vector (for example: "9", "#$%23", ...) then it wouldn't return those when you called grep on it.
If you want it just to return the matched portion you should look at the regmatches
function. However for your purposes it seems like sub
or gsub
should do what you want.
gsub(".*\\.", "", text)
I would suggest reading the help page for regexs ?regex
. The wikipedia page is a decent read as well but note that R's regexs are a little different than some others. https://en.wikipedia.org/wiki/Regular_expression
Upvotes: 5