Machinexa
Machinexa

Reputation: 599

How do i improve my regex to grep third level domain but not extra character at last?

This regex greps everything. How can i grep only domain but not extra chars.

echo "AAAA  cccc.google.com BBBB" | grep -oE "[^\.\n]*((\.[^\.\n]*){2}$)"  --color=always 

I want cccc.google.com to be grepped but not AAAA cccc.google.com BBBB. Adding \b doesnt work.
echo "AAAA cccc.google.com BBBB" | grep -oE "\b[^\.\n]*((\.[^\.\n]*){2}\b$)\b" --color=always

Edit: I forgot to say, i needed for grepping third level and fourth level domains. Here's what i meant:

My above regex was grepping third level domain but it grepped some other char so i asked question. Lets say i have AAAA a.b.c.d.e.g.google.com BBBB then {3} should give me g.google.com and {4} or {3,4} should give me e.g.google.com while at the same time omitting some unwanted character. My regex does exactly that but there is extra character!

So, using this regex(from answer, modified):
echo "AAAA d.cccc.google.com BBB" | grep -oE '\w+(\.\w+){2}'
omits the .com part which my regex doesnt(but it prints exta char :( ). So, could you please modify to work in this case.

Upvotes: 1

Views: 692

Answers (2)

Chase
Chase

Reputation: 5625

It looks like OP wants an interactive regex (clarified in the comments), that can extract n number of domains where the n is variable.

Something like this should work- (?:\w+(?:\.|\b)){4}(?=\.\w+(?: |$))\.\w+

Check out the demo

Usage

  • With {2}

    $ echo "AAAA  a.b.c.d.e.g.google.com BBB" | grep -oP "(?:\w+(?:\.|\b)){2}(?=\.\w+(?: |$))\.\w+"
    g.google.com
    
    Captures the 2 subdomains, excluding top level domain (i.e com)
  • With {3}

    $ echo "AAAA  a.b.c.d.e.g.google.com BBB" | grep -oP "(?:\w+(?:\.|\b)){3}(?=\.\w+(?: |$))\.\w+"
    e.g.google.com
    
    Captures the 3 subdomains, excluding top level domain(i.e com)

...and so on

Explanation

(?:\w+(?:\.|\b)){3} <- This is the same as my original answers, it just captures word characters followed by a ., exactly 3 times

(?=\.\w+(?: |$))\.\w+ <- This acts as the stopping point of the previous regex. It marks the start of the top level domain and captures it.

Original Answer

That regex seems completely wrong, if you want to only match urls like cccc.google.com and www.google.com but not google.com, you should use- (?:\w+(?:\.|\b)){3}

Check out the demo

Explanation

The primary part is \w+(?:\.|\b) - this matches word characters that are immediately followed by a . or a word boundary (i.e space)

This is enclosed with a (?:){3} which makes sure such groups are encountered 3 times.

To also grep 4th level domains, use just change the {3} to {3,4}

(?:\w+(?:\.|\b)){3,4}

Check out the demo

This is how you should do it with grep-

$ echo "AAAA  cccc.google.com BBB" | grep -oP "(?:\w+(?:\.|\b)){3,4}"
cccc.google.com

And with d.cccc.google.com

$ echo "AAAA  d.cccc.google.com BBB" | grep -oP "(?:\w+(?:\.|\b)){3,4}"
d.cccc.google.com

Upvotes: 5

Alexander Mashin
Alexander Mashin

Reputation: 4564

Just echo "AAAA cccc.google.com BBBB" | grep -oE '\w+(\.\w+)+' --color=always seems to work. \w is more or less what should be expected in domain names.

Upvotes: 2

Related Questions