Reputation: 335
I've a database table I'd like to return all values where a column contains a non ASCII character anywhere in the string.
Is there an easy way to do this?
I've tried this
select col_name, regexp_instr(col_name,'[^[:ascii:]]')
from test_table s
where created > sysdate - 1
and regexp_instr(col_name,'[^[:ascii:]]') > 0
limit 5;
but get this error:
error: Invalid character class name, collating name, or character range. The error occured while parsing the regular expression: '[^[:>>>HERE>>>ascii:]]'.
code: 8002
context: T_regexp_init
query: 5059536
location: funcs_expr.cpp:130
process: query20_31 [pid=7903]
Thanks!
Upvotes: 5
Views: 13984
Reputation: 338
I was trying to accomplish something similar recently but @BigDataKid's solution (writing '[^\x00-\x7F]'
in the regex expression) won't work.
Usually, a backslash in combination with a literal character can create a regex token with a special meaning, in this case \x
represents "the character whose hexadecimal value is" where 00
and 7F
are the hex values.
While Postgres supports it (see 9.7.3.3. Regular Expression Escapes), it seems that Redshift's regex engine does not. You can check what exactly Redshift supports here.
The shortest and cleanest workaround I've found for this is:
SELECT column_name,
CASE
WHEN regexp_instr(column_name, '[^[:print:][:cntrl:]]') > 0 THEN TRUE
ELSE FALSE END AS has_non_ascii_char
FROM table_name
WHERE has_non_ascii_char;
Regex explanation:
The outer brackets [
, ]
(or a list if you will) is called "bracket expression", the caret (^
) means negation, so [^]
translates to "match any character not in the list".
In the list we have [:print:]
and [:cntrl:]
which are called "POSIX character class". [:print:]
refers to any printable ASCII character (including space) and [:cntrl:]
refers to all ASCII control characters (e.g. tab), together they should cover all ASCII characters.
Upvotes: 8
Reputation: 13
SELECT regexp_replace(column_name, '[^'||chr(32)||'-'||chr(126)||']','')
FROM table_name
WHERE regexp_count(column_name, '[^'||chr(32)||'- '||chr(126)||']')> 0
Perhaps it will be useful to someone and https://www.asciitable.com/
Upvotes: 0
Reputation: 1227
You can select the range or characters you wish to use by looking at this table: http://unicode-table.com/en/ .
Look at the code below, where I filter based on a specific character range.
dev=# create table tb1 ( id integer, col1 varchar(10) , col2 varchar(10) ) distkey(id);
CREATE TABLE
dev=# insert into tb1 values (1, 'ABC', 'DEF');
INSERT 0 1
dev=# insert into tb1 select 1, chr(638), chr(640);
INSERT 0 1
dev=# insert into tb1 select 4, concat(concat('AB', chr(638)),'CD') , chr(640);
INSERT 0 1
dev=# select * from tb1;
id | col1 | col2
----+-------+------
4 | ABɾCD | ʀ
1 | ABC | DEF
1 | ɾ | ʀ
(3 rows)
dev=# select id, col1, regexp_instr(col1,'[^\x00-\x7F]') , col2, regexp_instr(col2,'[^\x00-\x7F]') from tb1;
id | col1 | regexp_instr | col2 | regexp_instr
----+-------+--------------+------+--------------
4 | ABɾCD | 3 | ʀ | 1
1 | ABC | 0 | DEF | 0
1 | ɾ | 1 | ʀ | 1
(3 rows)
Upvotes: 0