sdgfsdh
sdgfsdh

Reputation: 37085

Prevent grep in R from treating "." as a letter

I have a character vector that contains text similar to the following:

text <- c("ABc.def.xYz", "ge", "lmo.qrstu")

I would like to remove everything before a .:

> "xYz" "ge" "qrstu"

However, the grep function seems to be treating . as a letter:

pattern <- "([A-Z]|[a-z])+$"

grep(pattern, text, value = T)

> "ABc.def.xYz" "ge"          "lmo.qrstu" 

The pattern works elsewhere, such as on regexpal.

How can I get grep to behave as expected?

Upvotes: 5

Views: 635

Answers (4)

akrun
akrun

Reputation: 887431

grep is for finding the pattern. It returns the index of the vector that matches a pattern. If, value=TRUE is specified, it returns the value. From the description, it seems that you want to remove substring instead of returning a subset of the initial vector.

If you need to remove the substring, you can use sub

 sub('.*\\.', '', text)
 #[1] "xYz"   "ge"    "qrstu"

As the first argument, we match a pattern i.e. '.*\\.'. It matches one of more characters (.*) followed by a dot (\\.). The \\ is needed to escape the . to treat it as that symbol instead of any character. This will match until the last . character in the string. We replace that matched pattern with a '' as the replacement argument and thereby remove the substring.

Upvotes: 7

Carlos Cinelli
Carlos Cinelli

Reputation: 11607

Your pattern does work, the problem is that grep does something different than what you are thinking it does.

Let's first use your pattern with str_extract_all from the package stringr.

library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"

[[2]]
[1] "ge"

[[3]]
[1] "qrstu"

Notice that the results came as you expected!

The problem you are having is that grep will give you the complete element that matches you regular expression and not only the matching part of the element. For example, in the example below, grep will return you the first element because it matches "a":

grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"

Upvotes: 2

Avinash Raj
Avinash Raj

Reputation: 174756

You may try str_extract function from stringr package.

str_extract(text, "[^.]*$")

This would match all the non-dot characters exists at the last.

Upvotes: 4

Dason
Dason

Reputation: 61953

grep doesn't do any replacements. It searches for matches and returns the indices (or the value if you specify value=T) that give a match. The results you're getting are just saying that those meet your criteria at some point in the string. If you added something that doesn't meet the criteria anywhere into your text vector (for example: "9", "#$%23", ...) then it wouldn't return those when you called grep on it.

If you want it just to return the matched portion you should look at the regmatches function. However for your purposes it seems like sub or gsub should do what you want.

gsub(".*\\.", "", text)

I would suggest reading the help page for regexs ?regex. The wikipedia page is a decent read as well but note that R's regexs are a little different than some others. https://en.wikipedia.org/wiki/Regular_expression

Upvotes: 5

Related Questions