Reputation: 115
I am trying to capture domain names from a long string in R. The domain names are as follows.
11.22.44.55.url.com.localhost
The regex I am using is as following,
(gsub("(.*)\\.([^.]*url[^.]*)\\.(.*)","\\2","11.22.44.55.test.url.com.localhost",ignore.case=T)[1])
When I test it, I get the right answer that is
url.com
But when I run it as a job on a large dataset, (I run this using R and Hadoop), the result ends up being this,
11.22.44.55.url
And sometimes when the domain is
11.22.44.55.test.url.com.localhost
but I never get
url.com
I am not sure how this could happen. I know while I test it individually its fine but while running it on my actual dataset it fails. Am I missing any corner case that is causing a problem? Additional information on the dataset, each of these domain addresses is an element in a list, stored as a string, I extract this and run the gsub on it.
Upvotes: 0
Views: 139
Reputation: 81733
This solution is based on using sub
twice. First,".localhost"
is removed from the string. Then, the URL is extracted:
# example strings
test <- c("11.22.44.55.url.com.localhost",
"11.22.44.55.test.url.com.localhost",
"11.22.44.55.foo.bar.localhost")
sub(".*\\.(\\w+\\.\\w+)$", "\\1", sub("\\.localhost", "", test))
# [1] "url.com" "url.com" "foo.bar"
This solution works also for strings ending with "url.com"
(without ".localhost"
).
Upvotes: 1
Reputation: 1031
I'm not 100% sure what you're going for with the match, but this will grab "url" plus the next word/numeric sequence after that. I think the "*" wildcard is too greedy, so I made use of the "+", which matches 1 or more characters, rather than 0 or more (like "*").
>oobar = c(
>"11.22.44.55.url.com.localhost",
>"11.22.44.55.test.url.cog.localhost",
>"11.22.44.55.test.url.com.localhost"
>)
>f = function(url) (gsub("(.+)[\\.](url[\\.]+[^\\.]+)[\\.](.+)","\\2",url,ignore.case=TRUE))
>f(oobar)
[1] "url.com" "url.cog" "url.com"
Upvotes: 0
Reputation: 1437
Why not try something simpler, split on .
, and pick the parts you want
x <-unlist(strsplit("11.22.44.55.test.url.com.localhost",
split=".",fixed=T))
paste(x[6],x[7],sep=".")
Upvotes: 0