Reputation: 2173
we have an issue of one of the producers pushing some Unicode strings to a field which should be ASCII. Currently the job is pure-sql configurable hence I would like to know if it's possible to convert Unicode string to ASCII using just Spark SQL, something similar to solution given in this question (of course this will result in possible data loss for unsupported characters, but this is not a concern).
Upvotes: 0
Views: 4782
Reputation: 8711
You can remove the unwanted chars using the regexp_replace()
scala> spark.sql(""" SELECT regexp_replace(decode(encode('Ä??ABCDE', 'utf-8'), 'ascii'), "[^\t\n\r\x20-\x7F]","") x """).show(false)
+-----+
|x |
+-----+
|ABCDE|
+-----+
scala>
Upvotes: 1
Reputation: 42392
Try encode
:
SELECT encode(column, 'ascii') as column;
for example:
spark-sql> select encode('ÄÊÍABCDE', 'ascii');
???ABCDE
Upvotes: 2