Uthpala Dl
Uthpala Dl

Reputation: 45

PLSQL: Find invalid characters in a database column (UTF-8)

I have a text column in a table which I need to validate to recognize which records have non UTF-8 characters. Below is an example record where there are invalid characters.

text = 'PP632485 - Hala A - prace kuchnia Zepelin, wymiana muszli, monta􀄪 tablic i uchwytów na r􀄊czniki, wymiana zamka systemowego'

There are over 3 million records in this table, so I need to validate them all at once and get the rows where this text column has non UTF-8 characters.

I tried below:

Is there anything else I can do? Appreciate your input.

This is Oracle 11.2

Upvotes: 1

Views: 4438

Answers (2)

Jabrwoky
Jabrwoky

Reputation: 61

As shown by previous comments, you can detect the issue in place, but it's difficult to automatically correct in place.
I have used https://pypi.org/project/ftfy/ to correct invalidly encoded characters in large files. It guesses what the actual UTF8 character should be, and there are some controls on how it does this. For you, the problem is that you have to pull the data out, fix it, and put it back in. So assuming you can get the data out to the file system to fix it, you can locate files with bad encodings with something like this:

find . -type f | xargs -I {} bash -c "iconv -f utf-8 -t utf-16 {} &>/dev/null || echo {}"

This produces a list of files that potentially need to be processed by ftfy.

Upvotes: 0

Koen Lostrie
Koen Lostrie

Reputation: 18695

The characters you're seeing might be invalid for your data, but they are valid AL32UTF8 characters. Else they would not be displayed correctly. It's up to you to determine what character set contains the correct set of characters.

For example, to check if a string only contains characters in the US7ASCII character set, use the CONVERT function. Any character that cannot be converted into a valid US7ASCII character will be displayed as ?.

The example below first replaces the question marks with string '~~~~~', then converts and then checks for the existence of a question mark in the converted text.

WITH t (c) AS
(SELECT 'PP632485 - Hala A - prace kuchnia Zepelin, wymiana muszli, monta􀄪 tablic i uchwytów na r􀄊czniki, wymiana zamka systemowego' FROM DUAL UNION ALL
  SELECT 'Just a bit of normal text' FROM DUAL UNION ALL
  SELECT 'Question mark ?' FROM DUAL),
converted_t (c) AS
(
SELECT 
  CONVERT(
  REPLACE(c,'?','~~~~~')
  ,'US7ASCII','AL32UTF8')
  FROM t
)
SELECT CASE WHEN INSTR(c,'?') > 0 THEN 'Invalid' ELSE 'Valid' END as status, c
FROM converted_t
; 

Invalid 
PP632485 - Hala A - prace kuchnia Zepelin, wymiana muszli, montao??? tablic i uchwyt??w na ro??Sczniki, wymiana zamka systemowego                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

Valid   
Just a bit of normal text                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  

Valid   
Question mark ~~~~~

Again, this is just an example - you might need a less restrictive character set.

--UPDATE--

With your data: it's up to you to determine how you want to continue. Determine what is a good target data set. Contrary to what I set earlier, it's not mandatory to pass a "from dataset" argument in the CONVERT function.

Things you could try:

  1. Check which characters show up as '�' when converting from UTF8 at AL32UTF8
select * from G2178009_2020030114_dinllk
WHERE INSTR(CONVERT(text  ,'AL32UTF8','UTF8'),'�') > 0;
  1. Check if the converted text matches the original text. In this example I'm converting to UTF8 and comparing against the original text. If it is different then the converted text will not be the same as the original text.
select * from G2178009_2020030114_dinllk
WHERE
CONVERT(text  ,'UTF8') = text;

This should be enough tools for you to diagnose your data issue.

Upvotes: 1

Related Questions