Reputation: 730
I'm trying to subset out using regular expressions, the url: happy_to-learn.com.
As I'm really new to regex, could someone help with my code as to why it does not work?
x <- c("happy_to-learn.com", "His_is-omitted.net")
str_subset(x, "^[a-zA-Z](\\_|\\-)*\\.com$")
I understand that ^[a-zA-Z](\\_|\\-)*
this portion here refers to, "Start when you hit a range of alphabets from a to z or A to Z, and it contains either _ or -, if yes, then subset out this portion with 0 or more matches.
However, is it possible continue from this code by adding the back part of the value i wish to subset? i.e. \\.com$
refers to all values that end with .com.
Is there something like "^[a-zA-Z](\\_|\\-)*...\\.com$"
in regex?
Upvotes: 2
Views: 45
Reputation: 76653
Why use an external package? grep
can do it too.
grep("^[[:alpha:]_-]+.*\\.com$", x, value = TRUE)
#[1] "happy_to-learn.com"
Explanation.
"^"
marks the beginning of the string."[:alpha:]
matches any alphabetic character, upper or lower case in a portable way."^[[:alpha:]_-]+"
between []
, there are alternative characters to match repeated one or more times. Alphabetic or the underscore _
or the minus sign -
."^[[:alpha:]_-]+.*"
The above followed by any character repeated zero or more times."^[[:alpha:]_-]+.*\\.com$"
ending with the string ".com"
where the dot is not a metacharacter and therefore must be escaped.Upvotes: 1
Reputation: 887891
We need to specify one or more with +
as the _
or -
are not just after the first letter.
str_subset(x, "^[a-zA-Z]+(\\_|\\-).*\\.com$")
#[1] "happy_to-learn.com"
Also, the .*
refers to zero or more characters as .
can be any character until the .
and 'com' at the end ($
) of the string
Upvotes: 2