Maviles
Maviles

Reputation: 3469

How to decode strings that have special UTF-8 characters hex encoded in a pyspark dataframe

I have a pyspark DataFrame like the one bellow, where the special characters have been hex encoded.

+--------------------+
|item_name           |
+--------------------+
|Jogador n\xBA 10    |
|Camisa N\xB0 9      |
|Uniforme M\xE9dio   |
+--------------------+

And I need to decode it to UTF-8 chars, like this:

+--------------------+
|item_name           |
+--------------------+
|Jogador nº 10       |
|Camisa N° 9         |
|Uniforme Médio      |
+--------------------+

Pyspark decode function makes no difference:

df.withColumn('test', decode(col('item_name'),'UTF-8')).show()

+--------------------+--------------------+
|item_name           |test                |
+--------------------+--------------------+
|Jogador n\xBA 10    |Jogador n\xBA 10    |
|Camisa N\xB0 9      |Camisa N\xB0 9      |
|Uniforme M\xE9dio   |Uniforme M\xE9dio   |
+--------------------+--------------------+

Upvotes: 0

Views: 12447

Answers (1)

Maviles
Maviles

Reputation: 3469

Pyspark will not decode correctly if the hex vales are preceded by double backslashes (ex: \\xBA instead of \xBA).

Using "take(3)" instead of "show()" showed that in fact there was a second backslash:

[Row(item_name='Jogador n\\xBA 10'),
 Row(item_name='Camisa N\\xB0 9'),
 Row(item_name='Uniforme M\\xE9dio')]

To solve this I created a UDF to decode using "unicode-escape" method:

import pyspark.sql.functions as F
import pyspark.sql.types as T
my_udf = F.udf(lambda x: x.encode().decode('unicode-escape'),T.StringType())
df.withColumn('test', my_udf('item_name')).show()
+------------------+---------------+
|         item_name|           test|
+------------------+---------------+
|  Jogador n\xBA 10|  Jogador nº 10|
|    Camisa N\xB0 9|    Camisa N° 9|
| Uniforme M\xE9dio| Uniforme Médio|
+------------------+---------------+

Upvotes: 3

Related Questions