Ben
Ben

Reputation: 43

Regex to pull last 2 segments from FQDN

Working on trying to figure out some regex to pull out the last 2 segments of an FQDN.

^.*\shostname=[\w-]+\.(?P<myfield>[^\t]+)

This RegEx works and takes out the first segment of an FQDN.

www.aaa.bbb.someurl.net --> aaa.bbb.someurl.net

But… I only want to keep the last 2 segments of any FQDN.

I need it to be --> someurl.net

Other restrictions:
The hostname field will always be at least 3 segments - don't know the max.

This is for Splunk so I can't use a script. I need it to be PCRE compatible regex.

Here is an example of data:

2021-07-20 18:19:14 reason=Not allowed to browse this category event_id=12345 protocol=HTTP action=Blocked transactionsize=16051 responsesize=789 requestsize=456 urlcategory=Blocked serverip=1.2.4.5 clienttranstime=0 requestmethod=GET refererURL=None useragent=Microsoft-Delivery location=Internal ClientIP=5.6.7.8 status=403 user=John url=dl.delivery.mp.microsoft.com/filestreamingservice/files/abcd-efgh-ijkl/pieceshash vendor=Zscaler hostname=dl.delivery.mp.microsoft.com

From this I data I need the field “myfield” to be: microsoft.com.

Upvotes: 4

Views: 845

Answers (3)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627600

The original answer with a much simpler regex ((?:\s|^)hostname=(?:[^\s.]+\.)*(?P<myfield>[^\s.]+\.[^\s.]+)) that worked for OP is in the question history.


You can use

(?:\s|^)hostname=(?:[^\s.]+\.)*?(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S))

Or, to match the last hostname=... value on a line:

^.*\shostname=(?:[^\s.]+\.)*?(?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S))

See the regex #1 demo and regex #2 demo. Details:

  • (?:\s|^) - either a whitespace or start of string
  • hostname= - a literal substring
  • (?:[^\s.]+\.)*? - zero or more (but as few as possible) occurrences of one or more chars other than whitespace and dot and then a dot
  • (?P<myfield>[^\s.]+\.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk|[^\s.]+)(?!\S)) - Group "myfield": one or more chars other than whitespace and dot, then a dot, then any second-level domain or any one or more chars other than whitespace and dot and then either a whitespace or end of string.

Note: the \.(?:(?:ac|co)\.uk|govt?\.uk|judiciary\.uk|l(?:ea|td)\.uk|m(?:e|il|od)\.uk|n(?:et|hs|ic)\.uk|orgn?\.uk|p(?:arliament|lc|olice)\.uk|(?:royal|sch)\.uk pattern part (built from a regex trie) matches this list:

.ac.uk
.co.uk
.gov.uk
.judiciary.uk
.ltd.uk
.me.uk
.mod.uk
.net.uk
.nhs.uk
.nic.uk
.org.uk
.parliament.uk
.plc.uk
.police.uk
.royal.uk
.sch.uk
.co.uk
.ltd.uk
.me.uk
.net.uk
.nic.uk
.org.uk
.plc.uk
.sch.uk
.govt.uk
.orgn.uk
.lea.uk
.mil.uk

If you want to add more second-level domain names, add more to the list and use https://www.myregextester.com or suchlike services to built the word list regex.

Upvotes: 2

Soc
Soc

Reputation: 7780

If you would like to account for country codes, I've previously answered this at: Get Domain Extension From Hostname

The regular expression would look something like (shortened version): \w+((\.[a-z]{2,3})(\.(uk|au))?)$

The full expression with all country codes: \w+((\.[a-z]{2,3})(\.(ad|ae|af|ag|ai|al|am|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bl|bm|bn|bo|bq|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cu|cv|cw|cx|cy|cz|de|dj|dk|dm|do|dz|ec|ee|eg|er|es|et|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mf|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|ss|st|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|za|zm|zw))?)$

Upvotes: -1

The fourth bird
The fourth bird

Reputation: 163632

You could match all following non whitspace chars after hostname= and then use a capture group to capture the last part with a single dot.

^.*\shostname=(?:\S+\.)?([^\s.]+\.[^\s.]+)
  • ^.*\shostname=
  • (?:\S+\.)? Optionally match a possible dot before
  • ( Capture group 1
    • [^\s.]+\.[^\s.]+ Match 2 non dot parts with a . in between
  • ) Close group

Regex demo

Upvotes: 1

Related Questions