Reputation: 2145

Regular expression for a hexadecimal number?

How do I create a regular expression that detects hexadecimal numbers in a text?

For example, ‘0x0f4’, ‘0acdadecf822eeff32aca5830e438cb54aa722e3’, and ‘8BADF00D’.

Upvotes: 177

Answers (15)

Fabian Röling

Reputation: 1356

I took the idea from this answer to ignore words by introducing more conditions and took it to the extreme until I had created this 1000 character monster:

(?<![\dA-ZÄÖÜẞa-zäöüß])\#?(?#phone numbers)(?:\+[\d ]*)?(?:$\+?[\d ]+$ ?)?(?#AND blocks)(?=(?:(?#first)[a-f]*\d[\da-f]*(?:(?#mid)(?:(?: |\.|\,|\_|\/| ?[\/\-] ?)[\da-f]+)*(?#last)(?: |\.|\,|\_|\/| ?[\/\-] ?)[a-f]*\d[\da-f]*(?#no date+time)(?!\:))?|(?#again for capitals)[A-F]*\d[\dA-F]*(?:(?:(?: |\.|\,|\_|\/| ?[\/\-] ?)[\dA-F]+)*(?: |\.|\,|\_|\/| ?[\/\-] ?)[A-F]*\d[\dA-F]*(?!\:))?)(?#same length)(?![\dA-Fa-f])(?#anchor to end)([^\dA-ZÄÖÜẞa-zäöüß]?.*)$)(?#NAND date)(?!\d{1,4}([\.\_\/\-])\d{1,2}\2\d{1,4}(?#IP)(?!\2?[\da-fA-F]))(?#NAND part date)(?!\d\d([\.\_\/\-])\d{4}(?![\.\_\/\-]?[\da-fA-F]))(?#NAND year+time)(?!\d\d(?:\d\d)? \d{1,2}\:)(?#NAND house+city)(?!\d{1,3}[a-f]? \d{5} [A-Z])(?#AND length>5)[\da-fA-F](?:(?: |\.|\,|\_|\/| ?[\/\-] ?)?[\da-fA-F]){5}(?:(?#1 block)[\da-fA-F]*|(?#mid)[\da-fA-F \.\,\_\/\-]*(?: |\.|\,|\_|\/| ?[\/\-] ?)(?#last)[a-fA-F]*\d[\da-fA-F]*)(?#anchor to same end)(?=\1$)|(?#0x allows more)0x[\da-f]+(?:(?: |\.|\,|\_|\/| ?[\/\-] ?)[\da-f]+)*(?=[^\dA-ZÄÖÜẞa-zäöüß])

My goal was actually slightly different, I wanted to exclude a bunch of annoying unreadable strings from notification texts and TTS. With the explanation below it should hopefully be reasonably easy to adjust it. This regex matches hex numbers, phone numbers, IP addresses and more, it allows grouping stuff like 123 456 789, it specifically excludes stuff like regular words, addresses or dates and it contains basically an AND operator, which doesn't really exist in regex. I don't know if anyone invented this before, I couldn't find anything online.

Some example matches: 0x1, #123ABC, ab1 abc 123, +49 (0)12 / 34 - 56, 127.0.0.1
Some example non-matches: 1Abcde (mixed case), 12345 (need 6+ chars, except with "0x"), x123456x, 2023-01-01 00:00, Street 12a 34567 City, decade
Some potentially unintended matches: 100 1/10 b1 (might technically be a valid house number), Eva-Zilcher-Gasse 1a 1100 Vienna (I focused the address exclusion on Germany), cafe420, 2023-01-01 123456

Explanation of the AND/NAND operator

If you want to match either one of two conditions, that's easy: ([ab]|[bc]) matches a, b and c.
But what if you want to match both conditions? Something like ([ab]&[bc]) that matches only b doesn't exist.
Searching for it online results in lots of people actually meaning "a, then b or b, then a" (so both on the same line), which is not AND.
But it is actually possible: (?=[ab])[bc] matches only b!
This works with a "positive lookahead". That is a "non-capturing" group that just checks if something exists behind the current position, without extending the selection to include it. Left of it in this example is just nothing. Then it checks if there's a or b behind that, but the cursor stays where it is. Then it checks for b or c at the same position. Before this project, I only used lookaheads at the end of a regex, but they work everywhere.
It gets much more complicated if a condition can have a variable length. For example a(?=.{2})[bc]+ will match abbbc, even though the first condition only wants 2 characters. That's because both things exist behind the a, 2 characters and a bunch of bs and cs. It's just not both the same string. To prevent this, you actually have to check whether everything after it is the same string, which anchors the two ends to the same point.
Example: a(?=.{2}(.*)$)[bc]+(?=\1) will only match the abb part of abbbc. Here, .* captures the rest of the line, the $ ensures that it's actually all of it. (?=\1) then looks ahead to see if the rest of the line after the other condition is the same (or rather, it tries to find a spot where that is the case), without including it in the match. (?=) is not needed for the first occurrence, because it's already in a non-capturing group.
In some cases something even more complicated is necessary, because it seems like the regex engine doesn't always like to recalculate capture groups (()) for the backreference (\1) to match, in that case something similar to the end of the last condition might have to be repeated in the first condition (like a(?=.{1}[bc](.*)$)[bc]+(?=\1)). I don't fully understand that yet and I'm unsure whether generalised AND is even possible in all cases because of this, but I managed to make it work in this project, at least. During development, I even found some strange cases where (x|y) matched something, but (y|x) didn't (x and y stand for more complicated expressions here).

Explanation of components

(These explanations assume that you already know the most common regex elements, explaining everything from the start would take way too long.)

(?<![\dA-ZÄÖÜẞa-zäöüß]) checks for something to not include before this, so that it doesn't select something at the end of a word.
\#?: # occurs quite often before case numbers or so.
(?#phone) is a comment. I didn't know before this project that comments were possible, but it's really nice to spend less time looking for the right spot to modify in such a giant expression.
(?:) is a non-capturing group. It acts identically to (), except that it can't be referenced with \1 etc., which is nice, because I would have gone past \9 otherwise and the syntax for higher numbers or named references seems to depend on the platform.
(?:\+[\d ]*)?(?:$\+?[\d ]+$ ?) matches phone numbers, including spaces and potentially one set of brackets and/or a plus before or inside them. Technically it also matches 1(+2), but there's a limit to how much complexity I wanted to implement for increasibly unlikely cases.
Everything marked with (?#AND) or (?#NAND) is a condition for the main part, they all act on the same bit of text.
The first block is [a-f]*\d[\da-f]*, so it must include at least one digit, this must always exist.
(?: |\.|\,|\_|\/| ?[\/\-] ?) is a list of all the possible separators between digits, this occurs a bunch of times in the regex. ␣ is useful for lots of things, . for example for IP addresses, decimals or large numbers in German, , for large numbers or decimals in German, _ for file names with version numbers, / for case numbers, ␣/␣ and ␣-␣ for phone numbers.
(?: |\.|\,|\_|\/| ?[\/\-] ?)[\da-f]+)* means "a separator and then more hex digits, arbitrarily many times". It's inside an optional group (()?), so a single block with no separators also works.
(?: |\.|\,|\_|\/| ?[\/\-] ?)[a-f]*\d[\da-f]* is a mix of the previous two, it's any separator and then a block containing at least one digit. The middle blocks and the last block are together in the optional group, so the last block needs to exist. That means that 1 a 2 can be matched, but not 1 a, because a could theoretically be a word.
(?!\:) checks that there is no : after a multi-block string, to prevent matching e.g. 01-01 00 in the string 2023-01-01 00:00.
The previous five bullet points are then repeated again for capital letters, making sure that lowercase or uppercase is matched, but not a mix of both.
(?![\dA-Fa-f]) makes sure that the maximum length is matched and not some other substring that would work with this, but not another condition, before ([^\dA-ZÄÖÜẞa-zäöüß]?.*)$ selects the entire rest of the line and stores it in \1 for later verification that another condition actually matched the same text.
(?!\d{1,4}([\.\_\/\-])\d{1,2}\2\d{1,4}) is easier compared to the rest, it just checks for three groups of digits with constrained lengths and two of the same separator. This excludes dates in various formats, like 2001-01-01, 1.1.2001 or 01/01/01.
A little addition to that is (?!\2?[\da-fA-F]), which looks if there's another one of that separator and more (hex) digits after it and then includes it in the match again due to the double negative. This is meant for IP addresses or other longer arrangements of number groups.
It's a bit annoying that I had to include (?!\d\d([\.\_\/\-])\d{4}(?![\.\_\/\-]?[\da-fA-F])), which is very similar to the previous condition, but matches just 2 and then 4 digits, not followed by another group, to exclude e.g. a partial date with a word behind. I'm pretty sure there's no way to integrate that into the previous condition, I can't just make the first group of 1-4 digits optional, because then the capturing group for the first separator is not initialised and the backreference doesn't work. Forward-references also don't exist.
Similarly, (?!\d\d(?:\d\d)? \d{1,2}\:) excludes just the day or year and then the hour of a time written directly after it. It's very similar to an earlier date+time-excluding part, but catches slightly different cases.
Normally I would expect people to write addresses like "Street 12a, 34567 City", but they very often don't include the comma, so I wrote a special case for this: (?!\d{1,3}[a-f]? \d{5} [A-Z]). I actually read Wikipedia's article on (German) house numbering for this, but quickly decided to not cover all the madness that's possible in rare cases, because it would exclude way too many intended matches. This part of the regex is also quite German-centric, it doesn't match e.g. Austrian 4-digit postal codes or anything related to USA's numbered roads.
Now the last AND-linked condition: [\da-fA-F](?:(?: |\.|\,|\_|\/| ?[\/\-] ?)?[\da-fA-F]){5} looks simply for exactly 6 hex digits with optional separators, which can then be followed by…
… more hex digits or nothing: [\da-fA-F]*
… or some more blocks, the last of which needs to include a digit: [\da-fA-F \.\,\_\/\-]*(?: |\.|\,|\_|\/| ?[\/\-] ?)[a-fA-F]*\d[\da-fA-F]*
That's quite a bit of repetition of the first condition, as I said in the AND explanation.
And finally for the main part, (?=\1$) anchors this condition to the same end as the first condition, as explained above. It randomly happened that none of the other conditions needed this, they could just look for things at the start of the matched string.
And actually finally, 0x[\da-f]+(?:(?: |\.|\,|\_|\/| ?[\/\-] ?)[\da-f]+)* bypasses all of those rules, because it allows many more cases if the hex digit is prefixed with 0x, which I consider to be a good enough indicator that it's actually intended to be a hex digit, even if most of it is letters. (?=[^\dA-ZÄÖÜẞa-zäöüß]) makes sure that it matches the maximum possible length, but I'm not sure if this is even necessary, because * and + try to find as much as possible anyway. I also learned that | is enough for OR if you want to include everything on either side, you don't actually need (|).

I learned a lot in this project, which took me multiple days to finish. It was a lot of fun problem solving, interrupted by occasional frustration. Now I'll integrate this into my mail notification macro, hopefully that app supports all these fancy regex features…

Upvotes: 0

IndigoLily

Reputation: 11

If your regex engine has Extended Unicode Support, you can match a character that has the Hex_Digit property with \p{Hex_Digit}. Therefore, to match a hex number optionally prefixed with 0x, the regex would be (0x)?\p{Hex_Digit}+.

However, as @d512 points out in their comment on another answer, this is still a bit naïve, and will also match hex numbers concatenated with non-hex strings. To avoid this, surround the expression with word boundary anchors like so: \b(0x)?\p{Hex_Digit}+\b.

You can see this in action here. Unfortunately, it appears JavaScript doesn't properly support fullwidth characters together with word boundaries, but Rust's main regex crate, and Python with the regex module, do.

Upvotes: 1

Michał Kawiecki

Reputation: 385

first, instead of ^ and $ use \b as this is a word delimiter and can help when the hash is not the only string in the line.

i came here looking for similar but specialized regex and came up with this:

\b(\d+[a-f]+\d+[\da-f]*|[a-f]+\d+[a-f]+[\da-f]*)\b

I needed to detect hashes like git commit identifiers (and similar) in console and more then matching all possible hashes i prioritize NOT matching random words or numbers like EB or 12345678

So a heuristic approach i made is that I assume a hash will be alternating between numbers and letters reasonably often and the chains of only numbers or only letters will be short.

Another important fact is that MD5 hash is 32 characters long (as mentioned by @Adaddinsane) and git displays a shortened version with only 10 characters, so above example can be modified as follows:

for 10-char long hashes i assume the groups will be at most 3-char long

\b(\d+[a-f]+\d+[\da-f]{1,7}|[a-f]+\d+[a-f]+[\da-f]{1,7})\b

for up to 32-char long hashes i assume the groups will be at most 5-char long

\b(\d+[a-f]+\d+[\da-f]{17,29}|[a-f]+\d+[a-f]+[\da-f]{17,29})\b

you can easily change a-f to a-fA-F for case insensitivity or add 0[xX] at the front for that 0x prefix matching

those examples will obviously not match exotic but valid hashes that have very long sequences of only numbers or only letters in the front or extreme hashes like only 0s but this way i can match hashes and reduce accident false-positive matches significantly, like dir name or line number

Upvotes: 0

Sven

Reputation: 2553

In Java this is allowed:

(?:0x?)?[\p{XDigit}]+$

As you see the 0x is optional (even the x is optional) in a non-capturing group.

Upvotes: 1

Tommy Vasquez

Reputation: 71

Another example: Hexadecimal values for css colors start with a pound sign, or hash (#), then six characters that can either be a numeral or a letter between A and F, inclusive.

^#[0-9a-fA-F]{6}

Upvotes: 7

Paul Razvan Berg

Reputation: 21400

In case you need this within an input where the user can type 0 and 0x too but not a hex number without the 0x prefix:

^0?[xX]?[0-9a-fA-F]*$

Upvotes: 0

Fábio Borges

Reputation: 41

If you are looking for an specific hex character in the middle of the string, you can use "\xhh" where hh is the character in hexadecimal. I've tried and it works. I use framework for C++ Qt but it can solve problems in other cases, depends on the flavor you need to use (php, javascript, python , golang, etc.).

This answer was taken from:http://ult-tex.net/info/perl/

Upvotes: 4

batspy

Reputation: 395

Just for the record I would specify the following:

/^[xX]?[0-9a-fA-F]{6}$/

Which differs in that it checks that it has to contain the six valid characters and on lowercase or uppercase x in case we have one.

Upvotes: 6

joachim

Reputation: 30781

If you're using Perl or PHP, you can replace

[0-9a-fA-F]

with:

[[:xdigit:]]

Upvotes: 8

Local Needs

Reputation: 569

This one makes sure you have no more than three valid pairs:

(([a-fA-F]|[0-9]){2}){3}

Any more or less than three pairs of valid characters fail to match.

Upvotes: 1

Adaddinsane

Reputation: 535

It's worth mentioning that detecting an MD5 (which is one of the examples) can be done with:

[0-9a-fA-F]{32}

Upvotes: 18

smathy

Reputation: 27961

Not a big deal, but most regex engines support the POSIX character classes, and there's [:xdigit:] for matching hex characters, which is simpler than the common 0-9a-fA-F stuff.

So, the regex as requested (ie. with optional 0x) is: /(0x)?[[:xdigit:]]+/

Upvotes: 34

Pawel Furmaniak

Reputation: 4806

This will match with or without 0x prefix

(?:0[xX])?[0-9a-fA-F]+

Upvotes: 12

SimonMayer

Reputation: 4916

The exact syntax depends on your exact requirements and programming language, but basically:

/[0-9a-fA-F]+/

or more simply, i makes it case-insensitive.

/[0-9a-f]+/i

If you are lucky enough to be using Ruby, you can do:

/\h+/

EDIT - Steven Schroeder's answer made me realise my understanding of the 0x bit was wrong, so I've updated my suggestions accordingly. If you also want to match 0x, the equivalents are

/0[xX][0-9a-fA-F]+/
/0x[0-9a-f]+/i
/0x[\h]+/i

ADDED MORE - If 0x needs to be optional (as the question implies):

/(0x)?[0-9a-f]+/i

Upvotes: 65

Steven Schroeder

Reputation: 6194

How about the following?

0[xX][0-9a-fA-F]+

Matches expression starting with a 0, following by either a lower or uppercase x, followed by one or more characters in the ranges 0-9, or a-f, or A-F

Upvotes: 297

Regular expression for a hexadecimal number?

Answers (15)

Related Questions