How can these strings be different?

Question

I am facing a weird problem.

I have extracted data from an Excel file. It should contain an IBAN account number.

Then I tried to analyze the set of account numbers (which the source guarantees to be good) with a Java library.

To keep the scope of the question narrow, I can't explain the following. The below strings are different

03069
03069

The first is a copy & paste from the Excel file, the second is handwritten. Google returns different results for abi [above number] and in fact in the second case I can find that it is the bank code for Intesa Sanpaolo bank (exact page displaying the ABI code, localized, here).

So, to keep the scope narrow: how is that possible? Is it something to do with the encoding?

Try it yourself: do CTRL+F and try type "030", it will select both lines. Now type 6, it will match only the 2nd line.

Same happened in Notepad++

CodeCaster · Accepted Answer

There's an U+200B ZERO WIDTH SPACE in between 030 and 69 in the first text.

Paste the text in https://www.branah.com/unicode-converter for example, or edit in a hexadecimal capable editor.

The solution for cleaning such strings could be for example to whitelist characters, so replace everything that isn't A-Z0-9 will be scrubbed.

How can these strings be different?

Answers (1)

Related Questions