Turtle Head
Turtle Head

Reputation: 51

Regex Help required for User-Agent Matching

Have used an online regex learning site (regexr) and created something that works but with my very limited experience with regex creation, I could do with some help/advice.

In IIS10 logs, there is a list for time, date... but I am only interested in the cs(User-Agent) field.

My Regex:

(scan\-\d+)(?:\w)+\.shadowserver\.org

which matches these:

scan-02.shadowserver.org
scan-15n.shadowserver.org
scan-42o.shadowserver.org
scan-42j.shadowserver.org
scan-42b.shadowserver.org
scan-47m.shadowserver.org
scan-47a.shadowserver.org
scan-47c.shadowserver.org
scan-42a.shadowserver.org
scan-42n.shadowserver.org
scan-42o.shadowserver.org

but what I would like it to do is:

  1. Match a single number with the option of capturing more than one: scan-2 or scan-02 with an optional letter: scan-2j or scan-02f
  2. Append the rest of the User Agent: .shadowserver.org to the regex.

I will then add it to an existing URL Rewrite rule (as a condition) to abort the request.

Any advice/help would be very much appreciated

Tried:

To write a regex for IIS10 to block requests from a certain user-agent

Expected:

It to work on single numbers as well as double/triple numbers with or without a letter.

(scan\-\d+)(?:\w)+\.shadowserver\.org

Input Text:

scan-2.shadowserver.org
scan-02.shadowserver.org
scan-2j.shadowserver.org
scan-02j.shadowserver.org
scan-17w.shadowserver.org
scan-101p.shadowserver.org

UPDATE:

I eventually came up with this:

scan\-[0-9]+[a-z]{0,1}\.shadowserver\.org

Upvotes: 0

Views: 97

Answers (1)

SaSkY
SaSkY

Reputation: 1086

This is explanation of your regex pattern if you only want the solution, then go directly to the end.

(scan\-\d+)(?:\w)+

(scan\-\d+) Group1: match the word scan followed by a literal -, you escaped the hyphen with a \, but if you keep it without escaping it also means a literal - in this case, so you don't have to escape it here, the - followed by \d+ which means one more digit from 0-9 there must be at least one digit, then the value inside the group will be saved inside the first capturing group.

(?:\w)+ non-capturing group, \w one character which is equal to [A-Za-z0-9_], but the the plus + sign after the non-capturing group (?:\w)+, means match the whole group one or more times, the group contains only \w which means it will match one or more word character, note the non-capturing group here is redundant and we can use \w+ directly in this case.

Taking two examples:

The first example: scan-02.shadowserver.org

(scan\-\d+)(?:\w)+

  • scan will match the word scan in scan-02 and the \- will match the hyphen after scan scan-, the \d+ which means match one or more digit at first it will match the 02 after scan- and the value would be scan-02, then the (?:\w)+ part, the plus + means match one or more word character, at least match one, it will try to match the period . but it will fail, because the period . is not a word character, at this point, do you think it is over ? No , the regex engine will return back to the previous \d+, and this time it will only match the 0 in scan-02, and the value scan-0 will be saved inside the first capturing group, then the (?:\w)+ part will match the 2 in scan-02, but why the engine returns back to \d+ ? this is because you used the + sign after \d+, (?:\w)+ which means match at least one digit, and one word character respectively, so it will try to do what it is asked to do literally.

The second example: scan-2.shadowserver.org

(scan\-\d+)(?:\w)+

  • (scan\-\d+) will match scan-2, (?:\w)+ will try to match the period after scan-2 but it fails and this is the important point here, then it will go back to the beginning of the string scan-2.shadowserver.org and try to match (scan\-\d+) again but starting from the character c in the string , so s in (scan\-\d+) faild to match c, and it will continue trying, at the end it will fail.

Simple solution:

(scan-\d+[a-z]?)\.shadowserver\.org

Explanation

(scan-\d+[a-z]?), Group1: will capture the word scan, followed by a literal -, followed by \d+ one or more digits, followed by an optional small letter [a-z]? the ? make the [a-z] part optional, if not used, then the [a-z] means that there must be only one small letter.

See regex demo

Upvotes: 1

Related Questions