Reputation: 459
I have thousands of URLs and I want to extract domain names. I am using the following regex to do this: http://|https://|www\\.
This manages to extract domains like so:
elpais.com
veren.elpais.com
canaris7.es
vertele.eldiario.es
eldiario.es
The problem is that I want to only extract the domain -- that is, both vertele.eldiario.es and eldiario.es should give me eldiario.es.
I have used urltools
as well, but it doesn't seem to be doing the job. I need to extract the domain because I need to have a proper count of the specific domains in all URLs. I am interested in a regex that can extract TLDs ending in both .com and .es.
Upvotes: 0
Views: 527
Reputation: 7417
This regular expression .*\\.(.*\\.(com|es))
used with sub
to call the group (which is between parentheses) will do it.
url <- c(
"http://www.elpais.com",
"http://www.veren.elpais.com",
"http://www.canaris7.es",
"http://www.vertele.eldiario.es",
"http://www.eldiario.es"
)
sub(".*\\.(.*\\.(com|es))", "\\1", url)
[1] "elpais.com" "elpais.com" "canaris7.es" "eldiario.es" "eldiario.es"
Edit following the comment from @Corion to another answer:
If you are concerned about url having more complex suffixes, then you can use:
.*\\.(.*\\.(com|es)).*
url <- c(
"http://www.elpais.com",
"http://www.veren.elpais.com",
"http://www.canaris7.es",
"http://www.vertele.eldiario.es",
"http://www.eldiario.es",
"http://www.google.es.hk",
"http://www.google.com.br"
)
sub(".*\\.(.*\\.(com|es)).*", "\\1", url)
[1] "elpais.com" "elpais.com" "canaris7.es" "eldiario.es" "eldiario.es"
[6] "google.es" "google.com"
Upvotes: 2
Reputation: 37661
I think that you just want the last two components of the URL. You can get that from sub
and a regular expression.
URLs = c("http://www.elpais.com",
"http://veren.elpais.com",
"http://www.canaris7.es",
"http://vertele.eldiario.es",
"http://eldiario.es")
sub(".*\\b(\\w+\\.\\w+)", "\\1", URLs)
[1] "elpais.com" "elpais.com" "canaris7.es" "eldiario.es" "eldiario.es"
Upvotes: 1