Reputation: 141

Regex to match SHA1 but must contain HEX characters

I have this regex to find SHA1's in a Kusto column:

\b[a-fA-F0-9]{40}\b

However, I am getting lots of matches for non-hex numbers (only 1-9 digits). How can I ensure that the match contains at least one HEX digit (a-f)?

Kusto doesn't support lookarounds according to this: Does Kusto not support regex lookarounds?

Upvotes: 1

Answers (4)

David דודו Markovitz

Reputation: 44981

Use extract_all(), & array_length() to check the number of Hex strings Vs. the number of Dec strings.

Please note that with this method we don't really need to extract anything but empty strings.

datatable(text:string)
[
    "SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02"
   ,"Only digits: 6791012659213568246582140340987435098743"
   ,"Too short: f0cf934569319b10e85a9d"
   ,"Too long: 273d3fd2f0cf934569319b10e85a9dfadcff113c123"
   ,"888ead874a7c562ef1642e83cca05f2f920a2399"
]
| where array_length(extract_all(@"\b[[:xdigit:]]{40}\b()", text)) > coalesce(array_length(extract_all(@"\b\d{40}\b()", text)), 0)

text
SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02
888ead874a7c562ef1642e83cca05f2f920a2399

Fiddle

By leveraging set_difference() we can get the SHA1 values

datatable(text:string)
[
    "SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02"
   ,"Only digits: 6791012659213568246582140340987435098743"
   ,"Too short: f0cf934569319b10e85a9d"
   ,"Too long: 273d3fd2f0cf934569319b10e85a9dfadcff113c123"
   ,"888ead874a7c562ef1642e83cca05f2f920a2399"
]
| extend hex = extract_all(@"\b([[:xdigit:]]{40})\b", text), dec = extract_all(@"\b(\d{40})\b", text)
| extend sha1 = set_difference(hex, dec)

text	hex	dec	sha1
SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02	["273d3fd2f0cf934569319b10e85a9dfadcff113c","6791012659213568246582140340987435098743","e59c299bc9b181240c546464a93ac2d4d001ce02"]	["6791012659213568246582140340987435098743"]	["273d3fd2f0cf934569319b10e85a9dfadcff113c","e59c299bc9b181240c546464a93ac2d4d001ce02"]
Only digits: 6791012659213568246582140340987435098743	["6791012659213568246582140340987435098743"]	["6791012659213568246582140340987435098743"]	[]
Too short: f0cf934569319b10e85a9d
Too long: 273d3fd2f0cf934569319b10e85a9dfadcff113c123
888ead874a7c562ef1642e83cca05f2f920a2399	["888ead874a7c562ef1642e83cca05f2f920a2399"]		["888ead874a7c562ef1642e83cca05f2f920a2399"]

Fiddle

Upvotes: 1

David דודו Markovitz

Reputation: 44981

A solution based on extract_all() followed by matches regex on the results.

Extract all 40 length Hex strings and check if the result contains a character of the set [a-fA-F]

datatable(text:string)
[
    "SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02"
   ,"Only digits: 6791012659213568246582140340987435098743"
   ,"Too short: f0cf934569319b10e85a9d"
   ,"Too long: 273d3fd2f0cf934569319b10e85a9dfadcff113c123"
   ,"888ead874a7c562ef1642e83cca05f2f920a2399"
]
| where extract_all(@"\b([[:xdigit:]]{40})\b", text) matches regex "[a-fA-F]"

text
SHA1: 273d3fd2f0cf934569319b10e85a9dfadcff113c 6791012659213568246582140340987435098743 e59c299bc9b181240c546464a93ac2d4d001ce02
888ead874a7c562ef1642e83cca05f2f920a2399

Fiddle

Upvotes: 1

The fourth bird

Reputation: 163577

Perhaps you can match 40 digits between word boundaries to get that out of the way, and use an alternation | with a capture group ([a-fA-F0-9]{40}) to capture what you would allow with extract_all

\b[0-9]{40}\b|\b([a-fA-F0-9]{40})\b

See a regex demo with the capture group value.

Upvotes: 2

Milton Carranza

Reputation: 141

I made my query more efficient and was able to resolve later in the Kusto query instead of changing the regex. I will not mark this as an answer because the original question is about how to accomplish this from the regex itself and it would be interesting to have that answer.

This is what I did:

...
| where Content matches regex @'\b[a-fA-F0-9]{40}\b'
| extend match = extract_all(@'(\b[a-fA-F0-9]{40}\b)', Content) 
| mv-expand match
| where not (match matches regex @'\b[0-9]{40}\b')
...

In the last line I remove matches with all decimal digits

Upvotes: 0

Regex to match SHA1 but must contain HEX characters

Answers (4)

Related Questions