Reputation: 3469
I have a pyspark DataFrame like the one bellow, where the special characters have been hex encoded.
+--------------------+
|item_name |
+--------------------+
|Jogador n\xBA 10 |
|Camisa N\xB0 9 |
|Uniforme M\xE9dio |
+--------------------+
And I need to decode it to UTF-8 chars, like this:
+--------------------+
|item_name |
+--------------------+
|Jogador nº 10 |
|Camisa N° 9 |
|Uniforme Médio |
+--------------------+
Pyspark decode function makes no difference:
df.withColumn('test', decode(col('item_name'),'UTF-8')).show()
+--------------------+--------------------+
|item_name |test |
+--------------------+--------------------+
|Jogador n\xBA 10 |Jogador n\xBA 10 |
|Camisa N\xB0 9 |Camisa N\xB0 9 |
|Uniforme M\xE9dio |Uniforme M\xE9dio |
+--------------------+--------------------+
Upvotes: 0
Views: 12447
Reputation: 3469
Pyspark will not decode correctly if the hex vales are preceded by double backslashes (ex: \\xBA instead of \xBA).
Using "take(3)" instead of "show()" showed that in fact there was a second backslash:
[Row(item_name='Jogador n\\xBA 10'),
Row(item_name='Camisa N\\xB0 9'),
Row(item_name='Uniforme M\\xE9dio')]
To solve this I created a UDF to decode using "unicode-escape" method:
import pyspark.sql.functions as F
import pyspark.sql.types as T
my_udf = F.udf(lambda x: x.encode().decode('unicode-escape'),T.StringType())
df.withColumn('test', my_udf('item_name')).show()
+------------------+---------------+
| item_name| test|
+------------------+---------------+
| Jogador n\xBA 10| Jogador nº 10|
| Camisa N\xB0 9| Camisa N° 9|
| Uniforme M\xE9dio| Uniforme Médio|
+------------------+---------------+
Upvotes: 3