Reputation: 33
I have a problem with a regex that has to capture a substring that it's already captured...
I have this regex:
(?<domain>\w+\.\w+)($|\/|\.)
And I want to capture every subdomain recursively. For example, in this string:
test1.test2.abc.def
This expression captures test1.test2
and abc.def
but I need to capture:
test1.test2
test2.abc
abc.def
Do you know if there is any option to do this recursively?
Thanks!
Upvotes: 3
Views: 164
Reputation: 627292
You may use a well-known technique to extract overlapping matches, but you can't rely on \b
boundaries as they can match between a non-word / word char and word / non-word char. You need unambiguous word boundaries for left and right hand contexts.
Use
(?=(?<!\w)(?<domain>\w+\.\w+)(?!\w))
See the regex demo. Details:
(?=
- a positive lookahead that enables testing each location in the string and capture the part of string to the right of it
(?<!\w)
- a left-hand side word boundary(?<domain>\w+\.\w+)
- Group "domain": 1+ word chars, .
and 1+ word chars(?!\w)
- a right-hand side word boundary)
- end of the outer lookahead.Another approach is to use dots as word delimiters. Then use
(?=(?<![^.])(?<domain>[^.]+\.[^.]+)(?![^.]))
See this regex demo. Adjust as you see fit.
Upvotes: 0
Reputation: 165386
You can use a positive look ahead to capture the next group.
/(\w+)\.(?=(\w+))/g
Edit: JvdV's regex is more correct.
Note that \w+
is will fail to match domains like regex-tester.com
and will match invalid regex_tester.com
. [a-zA-Z0-9-]+
is closer to correct. See this answer for a complete regex.
It's simpler and more robust to do this by splitting on .
and iterating through the pieces in pairs. For example, in Ruby...
"test1.test2.abc.def".split(".").each_cons(2) { |a|
puts a.join(".")
}
test1.test2
test2.abc
abc.def
Upvotes: 1