Reputation: 335

How can I find any non ASCII characters in Redshift database

I've a database table I'd like to return all values where a column contains a non ASCII character anywhere in the string.

Is there an easy way to do this?

I've tried this

select col_name,  regexp_instr(col_name,'[^[:ascii:]]') 
from test_table s
where created > sysdate - 1 
and regexp_instr(col_name,'[^[:ascii:]]') > 0 
limit 5;

but get this error:

error:  Invalid character class name, collating name, or character range.  The error occured while parsing the regular expression: '[^[:>>>HERE>>>ascii:]]'.
  code:      8002
  context:   T_regexp_init
  query:     5059536
  location:  funcs_expr.cpp:130
  process:   query20_31 [pid=7903]

Thanks!

Upvotes: 5

Answers (3)

RiskyMaor

Reputation: 338

I was trying to accomplish something similar recently but @BigDataKid's solution (writing '[^\x00-\x7F]' in the regex expression) won't work.
Usually, a backslash in combination with a literal character can create a regex token with a special meaning, in this case \x represents "the character whose hexadecimal value is" where 00 and 7F are the hex values.
While Postgres supports it (see 9.7.3.3. Regular Expression Escapes), it seems that Redshift's regex engine does not. You can check what exactly Redshift supports here.

The shortest and cleanest workaround I've found for this is:

SELECT column_name,
       CASE
           WHEN regexp_instr(column_name, '[^[:print:][:cntrl:]]') > 0 THEN TRUE
           ELSE FALSE END AS has_non_ascii_char
FROM table_name
WHERE has_non_ascii_char;

Regex explanation:
The outer brackets [ , ] (or a list if you will) is called "bracket expression", the caret (^) means negation, so [^] translates to "match any character not in the list".
In the list we have [:print:] and [:cntrl:] which are called "POSIX character class". [:print:] refers to any printable ASCII character (including space) and [:cntrl:] refers to all ASCII control characters (e.g. tab), together they should cover all ASCII characters.

Upvotes: 8

Andrew B

Reputation: 13

SELECT regexp_replace(column_name, '[^'||chr(32)||'-'||chr(126)||']','') 
FROM table_name 
WHERE regexp_count(column_name, '[^'||chr(32)||'- '||chr(126)||']')> 0

Perhaps it will be useful to someone and https://www.asciitable.com/

Upvotes: 0

BigDataKid

Reputation: 1227

You can select the range or characters you wish to use by looking at this table: http://unicode-table.com/en/ .

Look at the code below, where I filter based on a specific character range.

dev=# create table tb1 ( id integer, col1 varchar(10) , col2 varchar(10) ) distkey(id);
CREATE TABLE

dev=# insert into tb1 values (1, 'ABC', 'DEF');
INSERT 0 1

dev=# insert into tb1 select 1, chr(638), chr(640);
INSERT 0 1
dev=# insert into tb1 select 4, concat(concat('AB', chr(638)),'CD') , chr(640);                                                                                  
INSERT 0 1

dev=# select * from tb1;
 id | col1  | col2 
----+-------+------
  4 | ABɾCD | ʀ
  1 | ABC   | DEF
  1 | ɾ     | ʀ
(3 rows)

dev=# select id, col1,  regexp_instr(col1,'[^\x00-\x7F]') , col2,  regexp_instr(col2,'[^\x00-\x7F]')  from tb1;
 id | col1  | regexp_instr | col2 | regexp_instr 
----+-------+--------------+------+--------------
  4 | ABɾCD |            3 | ʀ    |            1
  1 | ABC   |            0 | DEF  |            0
  1 | ɾ     |            1 | ʀ    |            1
(3 rows)

Upvotes: 0

How can I find any non ASCII characters in Redshift database

Answers (3)

Related Questions