Reputation: 5275
I have a dataset with some corrupted data - a string column has some strings containing \u0000
. I need to filter out all of them, and the only thing I have at my disposal is the where
clause.
I tried WHERE field NOT LIKE concat('%', chr(00), '%')
, but my hive distro (which is AWS EMR) doesn't recognize chr()
. Is there another option for filling out my where
clause to filter out fields containing \u0000
, without using chr()
?
Upvotes: 4
Views: 651
Reputation: 2838
You could try as follow
SELECT '\u0000' AS text;
+-------+--+
| text |
+-------+--+
| |
+-------+--+
-- NOT EMPTY
SELECT '\u0000abc' AS text;
+-------+--+
| text |
+-------+--+
| abc |
+-------+--+
-- NOT EMPTY
so
SELECT text
FROM(SELECT '\u0000abc' AS text) AS t
WHERE text NOT LIKE('\u0000%');
+-------+--+
| text |
+-------+--+
+-------+--+
-- EMPTY
SELECT text
FROM(SELECT '\u0000abc' AS text) AS t
WHERE text LIKE('\u0000%');
+-------+--+
| text |
+-------+--+
| abc |
+-------+--+
-- NOT EMPTY
Upvotes: 1