Identify hidden non-UTF8 encoded characters

Question

I am working in postgreSQL database and I have text column which in various languages like russian, chineses, korean, english etc. Although our application handles these languages well, we are having a issue dealing with non-UTF-8 characters.

For example, if you see the image from notepad++ where I have done Encoding > Encode in UTF-8, it neatly shows all the non-recognizable characters.

However, we are facing issue marking such records as non-process-able in postgres. Something like a flag should also do but I am trying something like below but it flags the valid russian records as well whereas notepad++ explicitly shows the hidden/non-UTF-8 characters.

Notepad++

Weird thing about these characters are that they do not show up regular select query but when I convert them to "UTF-8", those show up like below.

Database

Tried something like this (below query) but it does not seem to work i.e give me the desired output. Expectation is to set a flag to such records which have invalid hidden HTML references but not lose the valid text like the valid russian sentence in the snapshot. Should be able to distinctly identify only such texts.

select text, text ~ '[^[:ascii:]]', text ~ '^[\x00-\x7F]*$' 
from sample_data;

Sample Data -

"Я не наркоман. Это у меня всегда, когда мне афигитительно. А если серьёзно, это интересно,…"

"Ya le dieron amor a la foto de instagram de mi #UberCALAVERITA?"

"Executive Admininstrative Assistant in Toronto, ON for a Group"

"Сегодня валютные стратеги BMO обновили прогнозы по основным валютам на ближайшие пять кварталов (на конец периода): читать далее…"

"Flicitations Gestion d'actifs pour 6 Trophes #FundGradeA+2016 de fonds communs de placement :"

Identify hidden non-UTF8 encoded characters

Answers (1)

Related Questions