Regex to extract specific domain names in R

Question

I have thousands of URLs and I want to extract domain names. I am using the following regex to do this: http://|https://|www\.

This manages to extract domains like so:

elpais.com
veren.elpais.com
canaris7.es
vertele.eldiario.es
eldiario.es

The problem is that I want to only extract the domain -- that is, both vertele.eldiario.es and eldiario.es should give me eldiario.es.

I have used urltools as well, but it doesn't seem to be doing the job. I need to extract the domain because I need to have a proper count of the specific domains in all URLs. I am interested in a regex that can extract TLDs ending in both .com and .es.

prosoitos · Accepted Answer

This regular expression .*\.(.*\.(com|es)) used with sub to call the group (which is between parentheses) will do it.

url <-  c(
  "http://www.elpais.com",
  "http://www.veren.elpais.com",
  "http://www.canaris7.es",
  "http://www.vertele.eldiario.es",
  "http://www.eldiario.es"
)

sub(".*\.(.*\.(com|es))", "\1", url)

[1] "elpais.com"  "elpais.com"  "canaris7.es" "eldiario.es" "eldiario.es"

Edit following the comment from @Corion to another answer:

If you are concerned about url having more complex suffixes, then you can use:

.*\.(.*\.(com|es)).*

url <-  c(
  "http://www.elpais.com",
  "http://www.veren.elpais.com",
  "http://www.canaris7.es",
  "http://www.vertele.eldiario.es",
  "http://www.eldiario.es",
  "http://www.google.es.hk",
  "http://www.google.com.br"
)

sub(".*\.(.*\.(com|es)).*", "\1", url)

[1] "elpais.com"  "elpais.com"  "canaris7.es" "eldiario.es" "eldiario.es"
[6] "google.es"   "google.com"

Regex to extract specific domain names in R

Answers (2)

Related Questions