Reputation: 599
This regex greps everything. How can i grep only domain but not extra chars.
echo "AAAA cccc.google.com BBBB" | grep -oE "[^\.\n]*((\.[^\.\n]*){2}$)" --color=always
I want cccc.google.com
to be grepped but not AAAA cccc.google.com BBBB
. Adding \b doesnt work.
echo "AAAA cccc.google.com BBBB" | grep -oE "\b[^\.\n]*((\.[^\.\n]*){2}\b$)\b" --color=always
Edit: I forgot to say, i needed for grepping third level and fourth level domains. Here's what i meant:
g.google.com
This is a third level domaina.b.google.com
This is a 4th level domain.My above regex was grepping third level domain but it grepped some other char so i asked question.
Lets say i have AAAA a.b.c.d.e.g.google.com BBBB
then {3} should give me g.google.com and {4} or {3,4} should give me e.g.google.com while at the same time omitting some unwanted character. My regex does exactly that but there is extra character!
So, using this regex(from answer, modified):
echo "AAAA d.cccc.google.com BBB" | grep -oE '\w+(\.\w+){2}'
omits the .com part which my regex doesnt(but it prints exta char :( ). So, could you please modify to work in this case.
Upvotes: 1
Views: 692
Reputation: 5625
It looks like OP wants an interactive regex (clarified in the comments), that can extract n number of domains where the n is variable.
Something like this should work- (?:\w+(?:\.|\b)){4}(?=\.\w+(?: |$))\.\w+
Check out the demo
{2}
$ echo "AAAA a.b.c.d.e.g.google.com BBB" | grep -oP "(?:\w+(?:\.|\b)){2}(?=\.\w+(?: |$))\.\w+"
g.google.com
Captures the 2 subdomains, excluding top level domain (i.e com
){3}
$ echo "AAAA a.b.c.d.e.g.google.com BBB" | grep -oP "(?:\w+(?:\.|\b)){3}(?=\.\w+(?: |$))\.\w+"
e.g.google.com
Captures the 3 subdomains, excluding top level domain(i.e com
)...and so on
(?:\w+(?:\.|\b)){3}
<- This is the same as my original answers, it just captures word characters followed by a .
, exactly 3 times
(?=\.\w+(?: |$))\.\w+
<- This acts as the stopping point of the previous regex. It marks the start of the top level domain and captures it.
That regex seems completely wrong, if you want to only match urls like cccc.google.com
and www.google.com
but not google.com
, you should use- (?:\w+(?:\.|\b)){3}
Check out the demo
The primary part is \w+(?:\.|\b)
- this matches word characters that are immediately followed by a .
or a word boundary (i.e space)
This is enclosed with a (?:){3}
which makes sure such groups are encountered 3 times.
To also grep 4th level domains, use just change the {3}
to {3,4}
(?:\w+(?:\.|\b)){3,4}
Check out the demo
This is how you should do it with grep
-
$ echo "AAAA cccc.google.com BBB" | grep -oP "(?:\w+(?:\.|\b)){3,4}"
cccc.google.com
And with d.cccc.google.com
$ echo "AAAA d.cccc.google.com BBB" | grep -oP "(?:\w+(?:\.|\b)){3,4}"
d.cccc.google.com
Upvotes: 5
Reputation: 4564
Just echo "AAAA cccc.google.com BBBB" | grep -oE '\w+(\.\w+)+' --color=always
seems to work. \w
is more or less what should be expected in domain names.
Upvotes: 2