Javier
Javier

Reputation: 730

Subsetting a value based on partial pattern

I'm trying to subset out using regular expressions, the url: happy_to-learn.com.

As I'm really new to regex, could someone help with my code as to why it does not work?

x <- c("happy_to-learn.com", "His_is-omitted.net")
str_subset(x, "^[a-zA-Z](\\_|\\-)*\\.com$")

I understand that ^[a-zA-Z](\\_|\\-)* this portion here refers to, "Start when you hit a range of alphabets from a to z or A to Z, and it contains either _ or -, if yes, then subset out this portion with 0 or more matches.

However, is it possible continue from this code by adding the back part of the value i wish to subset? i.e. \\.com$ refers to all values that end with .com.

Is there something like "^[a-zA-Z](\\_|\\-)*...\\.com$" in regex?

Upvotes: 2

Views: 45

Answers (2)

Rui Barradas
Rui Barradas

Reputation: 76653

Why use an external package? grep can do it too.

grep("^[[:alpha:]_-]+.*\\.com$", x, value = TRUE)
#[1] "happy_to-learn.com"

Explanation.

  1. "^" marks the beginning of the string.
  2. "[:alpha:] matches any alphabetic character, upper or lower case in a portable way.
  3. "^[[:alpha:]_-]+" between [], there are alternative characters to match repeated one or more times. Alphabetic or the underscore _ or the minus sign -.
  4. "^[[:alpha:]_-]+.*" The above followed by any character repeated zero or more times.
  5. "^[[:alpha:]_-]+.*\\.com$" ending with the string ".com" where the dot is not a metacharacter and therefore must be escaped.

Upvotes: 1

akrun
akrun

Reputation: 887891

We need to specify one or more with + as the _ or - are not just after the first letter.

str_subset(x, "^[a-zA-Z]+(\\_|\\-).*\\.com$")
#[1] "happy_to-learn.com"

Also, the .* refers to zero or more characters as . can be any character until the . and 'com' at the end ($) of the string

Upvotes: 2

Related Questions