Reputation: 141
I have this regex to find SHA1's in a Kusto column:
\b[a-fA-F0-9]{40}\b
However, I am getting lots of matches for non-hex numbers (only 1-9 digits). How can I ensure that the match contains at least one HEX digit (a-f)?
Kusto doesn't support lookarounds according to this: Does Kusto not support regex lookarounds?
Upvotes: 1
Views: 262
Reputation: 44981
Use extract_all(), & array_length() to check the number of Hex strings Vs. the number of Dec strings.
Please note that with this method we don't really need to extract anything but empty strings.
datatable(text:string)
[
"SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02"
,"Only digits: 6791012659213568246582140340987435098743"
,"Too short: f0cf934569319b10e85a9d"
,"Too long: 273d3fd2f0cf934569319b10e85a9dfadcff113c123"
,"888ead874a7c562ef1642e83cca05f2f920a2399"
]
| where array_length(extract_all(@"\b[[:xdigit:]]{40}\b()", text)) > coalesce(array_length(extract_all(@"\b\d{40}\b()", text)), 0)
text |
---|
SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02 |
888ead874a7c562ef1642e83cca05f2f920a2399 |
By leveraging set_difference() we can get the SHA1 values
datatable(text:string)
[
"SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02"
,"Only digits: 6791012659213568246582140340987435098743"
,"Too short: f0cf934569319b10e85a9d"
,"Too long: 273d3fd2f0cf934569319b10e85a9dfadcff113c123"
,"888ead874a7c562ef1642e83cca05f2f920a2399"
]
| extend hex = extract_all(@"\b([[:xdigit:]]{40})\b", text), dec = extract_all(@"\b(\d{40})\b", text)
| extend sha1 = set_difference(hex, dec)
text | hex | dec | sha1 |
---|---|---|---|
SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02 | ["273d3fd2f0cf934569319b10e85a9dfadcff113c","6791012659213568246582140340987435098743","e59c299bc9b181240c546464a93ac2d4d001ce02"] | ["6791012659213568246582140340987435098743"] | ["273d3fd2f0cf934569319b10e85a9dfadcff113c","e59c299bc9b181240c546464a93ac2d4d001ce02"] |
Only digits: 6791012659213568246582140340987435098743 | ["6791012659213568246582140340987435098743"] | ["6791012659213568246582140340987435098743"] | [] |
Too short: f0cf934569319b10e85a9d | |||
Too long: 273d3fd2f0cf934569319b10e85a9dfadcff113c123 | |||
888ead874a7c562ef1642e83cca05f2f920a2399 | ["888ead874a7c562ef1642e83cca05f2f920a2399"] | ["888ead874a7c562ef1642e83cca05f2f920a2399"] |
Upvotes: 1
Reputation: 44981
A solution based on extract_all() followed by matches regex on the results.
Extract all 40 length Hex strings and check if the result contains a character of the set [a-fA-F]
datatable(text:string)
[
"SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02"
,"Only digits: 6791012659213568246582140340987435098743"
,"Too short: f0cf934569319b10e85a9d"
,"Too long: 273d3fd2f0cf934569319b10e85a9dfadcff113c123"
,"888ead874a7c562ef1642e83cca05f2f920a2399"
]
| where extract_all(@"\b([[:xdigit:]]{40})\b", text) matches regex "[a-fA-F]"
text |
---|
SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02 |
888ead874a7c562ef1642e83cca05f2f920a2399 |
Upvotes: 1
Reputation: 163577
Perhaps you can match 40 digits between word boundaries to get that out of the way, and use an alternation |
with a capture group ([a-fA-F0-9]{40})
to capture what you would allow with extract_all
\b[0-9]{40}\b|\b([a-fA-F0-9]{40})\b
See a regex demo with the capture group value.
Upvotes: 2
Reputation: 141
I made my query more efficient and was able to resolve later in the Kusto query instead of changing the regex. I will not mark this as an answer because the original question is about how to accomplish this from the regex itself and it would be interesting to have that answer.
This is what I did:
...
| where Content matches regex @'\b[a-fA-F0-9]{40}\b'
| extend match = extract_all(@'(\b[a-fA-F0-9]{40}\b)', Content)
| mv-expand match
| where not (match matches regex @'\b[0-9]{40}\b')
...
In the last line I remove matches with all decimal digits
Upvotes: 0