Xaloju
Xaloju

Reputation: 195

PL/SQL. Parse clob UTF8 chars with regexp_like regular expressions

I want to check if any line of my clob have strange characters like (ñ§). These characters are read from a csv-file with an unexpected encoding (UTF-8) which converts some of them.

I tried to filter each line using a regular expression but it's not working as intended. Is there a way to know the encoding of a csv-file when read?

How could I fix the regular expression to allow lines with only these characters? a-zA-Z 0-9 .,;:"'()-_& space tab.

Clob example readed from csv:

  l_clob clob :='
"exp","objc","objc","OBR","031110-5","S","EXAMPLE","NAME","08/03/2018",,"122","3","12,45"
 "xp","objc","obj","OBR","031300-5","S","EXAMPLE","NAME","08/03/2018",,"0","0","0"
';

Another clob:

DECLARE
    l_clob   CLOB
        := '"exp","objc","objc","OBR","031110-5","S","EXAMPLE","NAME","08/03/2018",,"122","3","12,45"
             "xp","objc","obj","OBR","031300-5","S","EXAMPLE","NAME","08/03/2018",,"0","0","0"';
    l_offset             PLS_INTEGER := 1;
    l_line               VARCHAR2 (32767);
    csvregexp   CONSTANT VARCHAR2 (1000)
        := '^([''"]+[-&\s(a-z0-9)]*[''"]+[,:;\t\s]?)?[''"]+[-&\s(a-z0-9)]*[''"]+' ;
    l_total_length       PLS_INTEGER := LENGTH (l_clob);
    l_line_length        PLS_INTEGER;
BEGIN

    WHILE l_offset <= l_total_length
    LOOP
        l_line_length := INSTR (l_clob, CHR (10), l_offset) - l_offset;

        IF l_line_length < 0
        THEN
            l_line_length := l_total_length + 1 - l_offset;
        END IF;

        l_line := SUBSTR (l_clob, l_offset, l_line_length);

        IF REGEXP_LIKE (l_line, csvregexp, 'i')
        THEN                                   -- i (case insensitive matches)
            DBMS_OUTPUT.put_line ('Ok');
            DBMS_OUTPUT.put_line (l_line);
        ELSE
            DBMS_OUTPUT.put_line ('Error');
            DBMS_OUTPUT.put_line (l_line);
        END IF;

        l_offset := l_offset + l_line_length + 1;
    END LOOP;
END;

Upvotes: 2

Views: 699

Answers (1)

kara
kara

Reputation: 3455

If you only want to allow special characters you can use this regex:

Your Regex

csvregexp   CONSTANT VARCHAR2 (1000) := '^[a-zA-Z 0-9 .,;:"''()-_&]+$' ;

Regex-Details

  • ^ Start of your string - no chars before this - prevents partial match
  • [] a set of allowed chars
  • []+ a set of allowed chars. Has to be one char minimum up to inf. (* instead of + would mean 0-inf.)
  • [a-zA-Z]+ 1 to inf. letters
  • [a-zA-Z0-9]+ 1 to inf. letters and numbers
  • $ end of your string - no chars behind this - prevents partial match

I think you can work it out with this ;-)

If you know there could be an other encoding in your input, you could try to convert and check against the regex again.

Example-convert

select convert('täst','us7ascii', 'utf8') from dual;

Upvotes: 1

Related Questions