mundos
mundos

Reputation: 459

Regex to extract specific domain names in R

I have thousands of URLs and I want to extract domain names. I am using the following regex to do this: http://|https://|www\\.

This manages to extract domains like so:

elpais.com
veren.elpais.com
canaris7.es
vertele.eldiario.es
eldiario.es

The problem is that I want to only extract the domain -- that is, both vertele.eldiario.es and eldiario.es should give me eldiario.es.

I have used urltools as well, but it doesn't seem to be doing the job. I need to extract the domain because I need to have a proper count of the specific domains in all URLs. I am interested in a regex that can extract TLDs ending in both .com and .es.

Upvotes: 0

Views: 527

Answers (2)

prosoitos
prosoitos

Reputation: 7417

This regular expression .*\\.(.*\\.(com|es)) used with sub to call the group (which is between parentheses) will do it.

url <-  c(
  "http://www.elpais.com",
  "http://www.veren.elpais.com",
  "http://www.canaris7.es",
  "http://www.vertele.eldiario.es",
  "http://www.eldiario.es"
)

sub(".*\\.(.*\\.(com|es))", "\\1", url)

[1] "elpais.com"  "elpais.com"  "canaris7.es" "eldiario.es" "eldiario.es"

Edit following the comment from @Corion to another answer:

If you are concerned about url having more complex suffixes, then you can use:

.*\\.(.*\\.(com|es)).*

url <-  c(
  "http://www.elpais.com",
  "http://www.veren.elpais.com",
  "http://www.canaris7.es",
  "http://www.vertele.eldiario.es",
  "http://www.eldiario.es",
  "http://www.google.es.hk",
  "http://www.google.com.br"
)

sub(".*\\.(.*\\.(com|es)).*", "\\1", url)

[1] "elpais.com"  "elpais.com"  "canaris7.es" "eldiario.es" "eldiario.es"
[6] "google.es"   "google.com"

Upvotes: 2

G5W
G5W

Reputation: 37661

I think that you just want the last two components of the URL. You can get that from sub and a regular expression.

URLs = c("http://www.elpais.com",
"http://veren.elpais.com",
"http://www.canaris7.es",
"http://vertele.eldiario.es",
"http://eldiario.es")

sub(".*\\b(\\w+\\.\\w+)", "\\1", URLs)
[1] "elpais.com"  "elpais.com"  "canaris7.es" "eldiario.es" "eldiario.es"

Upvotes: 1

Related Questions