Kausty
Kausty

Reputation: 859

Identify hidden non-UTF8 encoded characters

I am working in postgreSQL database and I have text column which in various languages like russian, chineses, korean, english etc. Although our application handles these languages well, we are having a issue dealing with non-UTF-8 characters.

For example, if you see the image from notepad++ where I have done Encoding > Encode in UTF-8, it neatly shows all the non-recognizable characters.

However, we are facing issue marking such records as non-process-able in postgres. Something like a flag should also do but I am trying something like below but it flags the valid russian records as well whereas notepad++ explicitly shows the hidden/non-UTF-8 characters.

Notepad++ notepad++ UTF-8

Weird thing about these characters are that they do not show up regular select query but when I convert them to "UTF-8", those show up like below.

Database enter image description here

Tried something like this (below query) but it does not seem to work i.e give me the desired output. Expectation is to set a flag to such records which have invalid hidden HTML references but not lose the valid text like the valid russian sentence in the snapshot. Should be able to distinctly identify only such texts.

select text, text ~ '[^[:ascii:]]', text ~ '^[\x00-\x7F]*$' 
from sample_data;

Sample Data -

"Я не наркоман. Это у меня всегда, когда мне афигитительно. А если серьёзно, это интересно,…"

"Ya le dieron amor a la foto de instagram de mi #UberCALAVERITA?"

"Executive Admininstrative Assistant in Toronto, ON for a Group"

"Сегодня валютные стратеги BMO обновили прогнозы по основным валютам на ближайшие пять кварталов (на конец периода): читать далее…"

"Flicitations Gestion d'actifs pour 6 Trophes #FundGradeA+2016 de fonds communs de placement :"

Upvotes: 1

Views: 5591

Answers (1)

Tom Blodget
Tom Blodget

Reputation: 20812

This answer might help you go back to fix problems. It doesn't directly help you to go forward in the direction you are asking about.

Looking at Flicitations and F\302\202licitations, the escapes look like octal, which is possibly a presentation choice of your "IDE" and/or the convert_to function. From octal, \302\202 is 0xC2 0x82, decoding as UTF-8 gives U+0082. In Unicode, that's a control character, in ISO 8859-1 it's a non-character, either might explain why some renderings make it invisible or take no space.

Now, Google tells me that Flicitations is almost like a French word, Félicitations. So, perhaps there is a character set and encoding where é is encoded as 0x82. Wikipedia helps here—Indeed there is: IBM850, which has been used for some French text.

So, it seems that someone has mishandled the user's text, causing data loss. The fundamental rule of text encoding is that text bytes must be read with the same encoding they were written with. Don't guess; Ask, or reference a standard, specification, documentation, or convention. Maybe you can go back and find the misbehaving process/code—at least that would prevent future data loss.


"Dealing with non-UTF-8 characters": There aren't really any non-UTF-8 characters. UTF-8 is an encoding of the Unicode character set. There are areas with exceptions but, practically speaking, Unicode has all characters, and UTF-8 can encode them all. So, if you think there are non-UTF-8 characters, the writer is either non-compliant or the reader is using the wrong encoding.

Upvotes: 1

Related Questions