Bartha
Bartha

Reputation: 115

Capturing specific part of domain name in R using regex

I am trying to capture domain names from a long string in R. The domain names are as follows.

11.22.44.55.url.com.localhost

The regex I am using is as following,

(gsub("(.*)\\.([^.]*url[^.]*)\\.(.*)","\\2","11.22.44.55.test.url.com.localhost",ignore.case=T)[1]) 

When I test it, I get the right answer that is

url.com

But when I run it as a job on a large dataset, (I run this using R and Hadoop), the result ends up being this,

11.22.44.55.url

And sometimes when the domain is

11.22.44.55.test.url.com.localhost

but I never get

url.com

I am not sure how this could happen. I know while I test it individually its fine but while running it on my actual dataset it fails. Am I missing any corner case that is causing a problem? Additional information on the dataset, each of these domain addresses is an element in a list, stored as a string, I extract this and run the gsub on it.

Upvotes: 0

Views: 139

Answers (3)

Sven Hohenstein
Sven Hohenstein

Reputation: 81733

This solution is based on using sub twice. First,".localhost" is removed from the string. Then, the URL is extracted:

# example strings
test <- c("11.22.44.55.url.com.localhost", 
          "11.22.44.55.test.url.com.localhost",
          "11.22.44.55.foo.bar.localhost")


sub(".*\\.(\\w+\\.\\w+)$", "\\1", sub("\\.localhost", "", test))
# [1] "url.com" "url.com" "foo.bar"

This solution works also for strings ending with "url.com" (without ".localhost").

Upvotes: 1

alex keil
alex keil

Reputation: 1031

I'm not 100% sure what you're going for with the match, but this will grab "url" plus the next word/numeric sequence after that. I think the "*" wildcard is too greedy, so I made use of the "+", which matches 1 or more characters, rather than 0 or more (like "*").


>oobar = c(
>"11.22.44.55.url.com.localhost",
>"11.22.44.55.test.url.cog.localhost",
>"11.22.44.55.test.url.com.localhost"
>)

>f = function(url) (gsub("(.+)[\\.](url[\\.]+[^\\.]+)[\\.](.+)","\\2",url,ignore.case=TRUE)) 
>f(oobar)

[1] "url.com" "url.cog" "url.com"

Upvotes: 0

ndr
ndr

Reputation: 1437

Why not try something simpler, split on ., and pick the parts you want

x <-unlist(strsplit("11.22.44.55.test.url.com.localhost",
    split=".",fixed=T))                   
paste(x[6],x[7],sep=".")

Upvotes: 0

Related Questions