Reputation: 2145
How do I create a regular expression that detects hexadecimal numbers in a text?
For example, ‘0x0f4’, ‘0acdadecf822eeff32aca5830e438cb54aa722e3’, and ‘8BADF00D’.
Upvotes: 177
Views: 325273
Reputation: 1356
I took the idea from this answer to ignore words by introducing more conditions and took it to the extreme until I had created this 1000 character monster:
(?<![\dA-ZÄÖÜẞa-zäöüß])\#?(?#phone numbers)(?:\+[\d ]*)?(?:\(\+?[\d ]+\) ?)?(?#AND blocks)(?=(?:(?#first)[a-f]*\d[\da-f]*(?:(?#mid)(?:(?: |\.|\,|\_|\/| ?[\/\-] ?)[\da-f]+)*(?#last)(?: |\.|\,|\_|\/| ?[\/\-] ?)[a-f]*\d[\da-f]*(?#no date+time)(?!\:))?|(?#again for capitals)[A-F]*\d[\dA-F]*(?:(?:(?: |\.|\,|\_|\/| ?[\/\-] ?)[\dA-F]+)*(?: |\.|\,|\_|\/| ?[\/\-] ?)[A-F]*\d[\dA-F]*(?!\:))?)(?#same length)(?![\dA-Fa-f])(?#anchor to end)([^\dA-ZÄÖÜẞa-zäöüß]?.*)$)(?#NAND date)(?!\d{1,4}([\.\_\/\-])\d{1,2}\2\d{1,4}(?#IP)(?!\2?[\da-fA-F]))(?#NAND part date)(?!\d\d([\.\_\/\-])\d{4}(?![\.\_\/\-]?[\da-fA-F]))(?#NAND year+time)(?!\d\d(?:\d\d)? \d{1,2}\:)(?#NAND house+city)(?!\d{1,3}[a-f]? \d{5} [A-Z])(?#AND length>5)[\da-fA-F](?:(?: |\.|\,|\_|\/| ?[\/\-] ?)?[\da-fA-F]){5}(?:(?#1 block)[\da-fA-F]*|(?#mid)[\da-fA-F \.\,\_\/\-]*(?: |\.|\,|\_|\/| ?[\/\-] ?)(?#last)[a-fA-F]*\d[\da-fA-F]*)(?#anchor to same end)(?=\1$)|(?#0x allows more)0x[\da-f]+(?:(?: |\.|\,|\_|\/| ?[\/\-] ?)[\da-f]+)*(?=[^\dA-ZÄÖÜẞa-zäöüß])
My goal was actually slightly different, I wanted to exclude a bunch of annoying unreadable strings from notification texts and TTS. With the explanation below it should hopefully be reasonably easy to adjust it. This regex matches hex numbers, phone numbers, IP addresses and more, it allows grouping stuff like 123 456 789
, it specifically excludes stuff like regular words, addresses or dates and it contains basically an AND
operator, which doesn't really exist in regex. I don't know if anyone invented this before, I couldn't find anything online.
Some example matches: 0x1
, #123ABC
, ab1 abc 123
, +49 (0)12 / 34 - 56
, 127.0.0.1
Some example non-matches: 1Abcde
(mixed case), 12345
(need 6+ chars, except with "0x"), x123456x
, 2023-01-01 00:00
, Street
12a 34567
City
, decade
Some potentially unintended matches: 100 1/10 b1
(might technically be a valid house number), Eva-Zilcher-Gasse
1a 1100
Vienna
(I focused the address exclusion on Germany), cafe420
, 2023-
01-01 123456
Explanation of the AND
/NAND
operator
If you want to match either one of two conditions, that's easy: ([ab]|[bc])
matches a
, b
and c
.
But what if you want to match both conditions? Something like ([ab]&[bc])
that matches only b
doesn't exist.
Searching for it online results in lots of people actually meaning "a, then b or b, then a" (so both on the same line), which is not AND
.
But it is actually possible: (?=[ab])[bc]
matches only b
!
This works with a "positive lookahead". That is a "non-capturing" group that just checks if something exists behind the current position, without extending the selection to include it. Left of it in this example is just nothing. Then it checks if there's a
or b
behind that, but the cursor stays where it is. Then it checks for b
or c
at the same position. Before this project, I only used lookaheads at the end of a regex, but they work everywhere.
It gets much more complicated if a condition can have a variable length. For example a(?=.{2})[bc]+
will match abbbc
, even though the first condition only wants 2 characters. That's because both things exist behind the a
, 2 characters and a bunch of b
s and c
s. It's just not both the same string. To prevent this, you actually have to check whether everything after it is the same string, which anchors the two ends to the same point.
Example: a(?=.{2}(.*)$)[bc]+(?=\1)
will only match the abb
part of abbbc
. Here, .*
captures the rest of the line, the $
ensures that it's actually all of it. (?=\1)
then looks ahead to see if the rest of the line after the other condition is the same (or rather, it tries to find a spot where that is the case), without including it in the match. (?=)
is not needed for the first occurrence, because it's already in a non-capturing group.
In some cases something even more complicated is necessary, because it seems like the regex engine doesn't always like to recalculate capture groups (()
) for the backreference (\1
) to match, in that case something similar to the end of the last condition might have to be repeated in the first condition (like a(?=.{1}[bc](.*)$)[bc]+(?=\1)
). I don't fully understand that yet and I'm unsure whether generalised AND
is even possible in all cases because of this, but I managed to make it work in this project, at least. During development, I even found some strange cases where (x|y)
matched something, but (y|x)
didn't (x
and y
stand for more complicated expressions here).
Explanation of components
(These explanations assume that you already know the most common regex elements, explaining everything from the start would take way too long.)
(?<![\dA-ZÄÖÜẞa-zäöüß])
checks for something to not include before this, so that it doesn't select something at the end of a word.\#?
: #
occurs quite often before case numbers or so.(?#phone)
is a comment. I didn't know before this project that comments were possible, but it's really nice to spend less time looking for the right spot to modify in such a giant expression.(?:)
is a non-capturing group. It acts identically to ()
, except that it can't be referenced with \1
etc., which is nice, because I would have gone past \9
otherwise and the syntax for higher numbers or named references seems to depend on the platform.(?:\+[\d ]*)?(?:\(\+?[\d ]+\) ?)
matches phone numbers, including spaces and potentially one set of brackets and/or a plus before or inside them. Technically it also matches 1(+2)
, but there's a limit to how much complexity I wanted to implement for increasibly unlikely cases.(?#AND)
or (?#NAND)
is a condition for the main part, they all act on the same bit of text.[a-f]*\d[\da-f]*
, so it must include at least one digit, this must always exist.(?: |\.|\,|\_|\/| ?[\/\-] ?)
is a list of all the possible separators between digits, this occurs a bunch of times in the regex. ␣
is useful for lots of things, .
for example for IP addresses, decimals or large numbers in German, ,
for large numbers or decimals in German, _
for file names with version numbers, /
for case numbers, ␣/␣
and ␣-␣
for phone numbers.(?: |\.|\,|\_|\/| ?[\/\-] ?)[\da-f]+)*
means "a separator and then more hex digits, arbitrarily many times". It's inside an optional group (()?
), so a single block with no separators also works.(?: |\.|\,|\_|\/| ?[\/\-] ?)[a-f]*\d[\da-f]*
is a mix of the previous two, it's any separator and then a block containing at least one digit. The middle blocks and the last block are together in the optional group, so the last block needs to exist. That means that 1 a 2
can be matched, but not 1 a
, because a
could theoretically be a word.(?!\:)
checks that there is no :
after a multi-block string, to prevent matching e.g. 01-01 00
in the string 2023-01-01 00:00
.(?![\dA-Fa-f])
makes sure that the maximum length is matched and not some other substring that would work with this, but not another condition, before ([^\dA-ZÄÖÜẞa-zäöüß]?.*)$
selects the entire rest of the line and stores it in \1
for later verification that another condition actually matched the same text.(?!\d{1,4}([\.\_\/\-])\d{1,2}\2\d{1,4})
is easier compared to the rest, it just checks for three groups of digits with constrained lengths and two of the same separator. This excludes dates in various formats, like 2001-01-01
, 1.1.2001
or 01/01/01
.(?!\2?[\da-fA-F])
, which looks if there's another one of that separator and more (hex) digits after it and then includes it in the match again due to the double negative. This is meant for IP addresses or other longer arrangements of number groups.(?!\d\d([\.\_\/\-])\d{4}(?![\.\_\/\-]?[\da-fA-F]))
, which is very similar to the previous condition, but matches just 2 and then 4 digits, not followed by another group, to exclude e.g. a partial date with a word behind. I'm pretty sure there's no way to integrate that into the previous condition, I can't just make the first group of 1-4 digits optional, because then the capturing group for the first separator is not initialised and the backreference doesn't work. Forward-references also don't exist.(?!\d\d(?:\d\d)? \d{1,2}\:)
excludes just the day or year and then the hour of a time written directly after it. It's very similar to an earlier date+time-excluding part, but catches slightly different cases.(?!\d{1,3}[a-f]? \d{5} [A-Z])
. I actually read Wikipedia's article on (German) house numbering for this, but quickly decided to not cover all the madness that's possible in rare cases, because it would exclude way too many intended matches. This part of the regex is also quite German-centric, it doesn't match e.g. Austrian 4-digit postal codes or anything related to USA's numbered roads.AND
-linked condition: [\da-fA-F](?:(?: |\.|\,|\_|\/| ?[\/\-] ?)?[\da-fA-F]){5}
looks simply for exactly 6 hex digits with optional separators, which can then be followed by…[\da-fA-F]*
[\da-fA-F \.\,\_\/\-]*(?: |\.|\,|\_|\/| ?[\/\-] ?)[a-fA-F]*\d[\da-fA-F]*
AND
explanation.(?=\1$)
anchors this condition to the same end as the first condition, as explained above. It randomly happened that none of the other conditions needed this, they could just look for things at the start of the matched string.0x[\da-f]+(?:(?: |\.|\,|\_|\/| ?[\/\-] ?)[\da-f]+)*
bypasses all of those rules, because it allows many more cases if the hex digit is prefixed with 0x
, which I consider to be a good enough indicator that it's actually intended to be a hex digit, even if most of it is letters. (?=[^\dA-ZÄÖÜẞa-zäöüß])
makes sure that it matches the maximum possible length, but I'm not sure if this is even necessary, because *
and +
try to find as much as possible anyway. I also learned that |
is enough for OR
if you want to include everything on either side, you don't actually need (|)
.I learned a lot in this project, which took me multiple days to finish. It was a lot of fun problem solving, interrupted by occasional frustration. Now I'll integrate this into my mail notification macro, hopefully that app supports all these fancy regex features…
Upvotes: 0
Reputation: 11
If your regex engine has Extended Unicode Support, you can match a character that has the Hex_Digit property with \p{Hex_Digit}
. Therefore, to match a hex number optionally prefixed with 0x
, the regex would be (0x)?\p{Hex_Digit}+
.
However, as @d512 points out in their comment on another answer, this is still a bit naïve, and will also match hex numbers concatenated with non-hex strings. To avoid this, surround the expression with word boundary anchors like so: \b(0x)?\p{Hex_Digit}+\b
.
You can see this in action here. Unfortunately, it appears JavaScript doesn't properly support fullwidth characters together with word boundaries, but Rust's main regex crate, and Python with the regex module, do.
Upvotes: 1
Reputation: 385
first, instead of ^
and $
use \b
as this is a word delimiter and can help when the hash is not the only string in the line.
i came here looking for similar but specialized regex and came up with this:
\b(\d+[a-f]+\d+[\da-f]*|[a-f]+\d+[a-f]+[\da-f]*)\b
I needed to detect hashes like git commit identifiers (and similar) in console and more then matching all possible hashes i prioritize NOT matching random words or numbers like EB
or 12345678
So a heuristic approach i made is that I assume a hash will be alternating between numbers and letters reasonably often and the chains of only numbers or only letters will be short.
Another important fact is that MD5 hash is 32 characters long (as mentioned by @Adaddinsane) and git displays a shortened version with only 10 characters, so above example can be modified as follows:
for 10-char long hashes i assume the groups will be at most 3-char long
\b(\d+[a-f]+\d+[\da-f]{1,7}|[a-f]+\d+[a-f]+[\da-f]{1,7})\b
for up to 32-char long hashes i assume the groups will be at most 5-char long
\b(\d+[a-f]+\d+[\da-f]{17,29}|[a-f]+\d+[a-f]+[\da-f]{17,29})\b
you can easily change a-f
to a-fA-F
for case insensitivity or add 0[xX]
at the front for that 0x
prefix matching
those examples will obviously not match exotic but valid hashes that have very long sequences of only numbers or only letters in the front or extreme hashes like only 0
s
but this way i can match hashes and reduce accident false-positive matches significantly, like dir name or line number
Upvotes: 0
Reputation: 2553
In Java this is allowed:
(?:0x?)?[\p{XDigit}]+$
As you see the 0x
is optional (even the x
is optional) in a non-capturing group.
Upvotes: 1
Reputation: 71
Another example: Hexadecimal values for css colors start with a pound sign, or hash (#), then six characters that can either be a numeral or a letter between A and F, inclusive.
^#[0-9a-fA-F]{6}
Upvotes: 7
Reputation: 21400
In case you need this within an input where the user can type 0
and 0x
too but not a hex number without the 0x
prefix:
^0?[xX]?[0-9a-fA-F]*$
Upvotes: 0
Reputation: 41
If you are looking for an specific hex character in the middle of the string, you can use "\xhh" where hh is the character in hexadecimal. I've tried and it works. I use framework for C++ Qt but it can solve problems in other cases, depends on the flavor you need to use (php, javascript, python , golang, etc.).
This answer was taken from:http://ult-tex.net/info/perl/
Upvotes: 4
Reputation: 395
Just for the record I would specify the following:
/^[xX]?[0-9a-fA-F]{6}$/
Which differs in that it checks that it has to contain the six valid characters and on lowercase or uppercase x in case we have one.
Upvotes: 6
Reputation: 30781
If you're using Perl or PHP, you can replace
[0-9a-fA-F]
with:
[[:xdigit:]]
Upvotes: 8
Reputation: 569
This one makes sure you have no more than three valid pairs:
(([a-fA-F]|[0-9]){2}){3}
Any more or less than three pairs of valid characters fail to match.
Upvotes: 1
Reputation: 535
It's worth mentioning that detecting an MD5 (which is one of the examples) can be done with:
[0-9a-fA-F]{32}
Upvotes: 18
Reputation: 27961
Not a big deal, but most regex engines support the POSIX character classes, and there's [:xdigit:]
for matching hex characters, which is simpler than the common 0-9a-fA-F
stuff.
So, the regex as requested (ie. with optional 0x
) is: /(0x)?[[:xdigit:]]+/
Upvotes: 34
Reputation: 4806
This will match with or without 0x
prefix
(?:0[xX])?[0-9a-fA-F]+
Upvotes: 12
Reputation: 4916
The exact syntax depends on your exact requirements and programming language, but basically:
/[0-9a-fA-F]+/
or more simply, i
makes it case-insensitive.
/[0-9a-f]+/i
If you are lucky enough to be using Ruby, you can do:
/\h+/
EDIT - Steven Schroeder's answer made me realise my understanding of the 0x bit was wrong, so I've updated my suggestions accordingly. If you also want to match 0x, the equivalents are
/0[xX][0-9a-fA-F]+/
/0x[0-9a-f]+/i
/0x[\h]+/i
ADDED MORE - If 0x needs to be optional (as the question implies):
/(0x)?[0-9a-f]+/i
Upvotes: 65
Reputation: 6194
How about the following?
0[xX][0-9a-fA-F]+
Matches expression starting with a 0, following by either a lower or uppercase x, followed by one or more characters in the ranges 0-9, or a-f, or A-F
Upvotes: 297