Best way to remove invisible Unicode characters from text using Power Query in Excel

Question

I have two tables of data downloaded from an Oracle Cloud financial system and imported into Microsoft Excel.

Both had Invoice Number fields but, when I used Power Query to transform them into compatible formats and then merge the transformed tables, I found that the Merge process could not match seemingly identical invoice numbers from the two tables.

The obvious first thing to do was to create a custom column by using Text.Trim() to remove leading and trailing spaces from the invoice numbers. It turned out spaces were present in one of the Invoice Number fields, but the Merge process still did not match after I removed them with Text.Trim().

I then compared visibly identical trimmed invoice numbers using the = operator in an Excel cell, but Excel produced the result TRUE -- despite the Power Query Merge still not matching any of the invoice numbers!

After exploring many potential reasons for why the match might fail when the invoice numbers were identical to my eye and to Excel's = operator, it finally occurred to me to check that the invoice numbers were identical from a Unicode perspective.

I therefore used the UNICODE() function in Excel to check the code points for each character in the invoice numbers from the different source spreadsheets. I found that the invoice numbers from one spreadsheet each included the code point 8237 before the number and the code point 8236 after the number, neither of which is a visible character. This point was previously mentioned in Hidden Character | MrExcel Message Board.

These two characters are explained in Shapecatcher.com: Unicode character information for 'Left-to-right override' (0x202D) and Shapecatcher.com: Unicode character information for 'Pop directional formatting' (0x202C).

To remove these characters from the invoice numbers, I decided to use the following Power Query M code for the time being:

Text.Select([Invoice Number],{" ".."~"})

... which selects only the characters in the Invoice Number that lie between space and tilde, and my Merge now works.

That's fine for this particular situation. Indeed, since the invoice numbers only include digits in this case, I could have used the simpler Text.Select([Invoice Number],{"0".."9"}).

However, I would ideally like to find a more general expression that:

removes all invisible characters that might prevent a match (I know about these two, but I bet there are others that I don't yet know about!), and
doesn't remove (e.g.) characters with accents and diacritical marks that might appear in fields containing people's names that I might try to merge on in the future.

How do I remove invisible unicode characters from the beginning and the end of a string? asked a similar question in the context of C# and there is an accepted answer, but that answer relied on accessing Unicode character properties through a regular expression, and I'm pretty sure I can't either use regular expressions or access Unicode character properties in Power Query?

Is there a better alternative than trying to find all the Unicode characters that have the properties shown in the accepted answer to How do I remove invisible unicode characters from the beginning and the end of a string? and including them in a call to Text.Remove() in Power Query?

Alternatively, if I have to find a list of all the Unicode characters that have those properties, is there a straightforward way to do that?

Sam Nseir · Accepted Answer

Here is an idea:

Text.Trim( Text.Remove([Invoice Number], { Character.FromNumber(8192)..Character.FromNumber(8303) }) )

This will strip all the Unicode characters in this range: U+2000 - U+206F

Also have a look at: https://invisible-characters.com/ if you want to get all of these characters. However, I feel the U+2000 - U+206F range will suffice so long as you don't mind dropping all the visible characters in this range as well.

Best way to remove invisible Unicode characters from text using Power Query in Excel

Answers (2)

Related Questions