Pavel Prochazka
Pavel Prochazka

Reputation: 771

How to remove diacritics in pyspark dataframes?

I am wondering how to remove diacritics in Pyspark Dataframe with Python2. I would need something like

from pyspark.sql.session import SparkSession
from pyspark import SparkContext
import pyspark.sql.functions as sf
from pyspark.sql.types import StringType

df = sc.parallelize([(u'pádlo', 1), (u'dřez', 4)]).toDF(['text', 'num'])

def remove_diacritics(s):
    return unidecode.unidecode(s)

rem_udf = sf.udf(remove_diacritics, StringType())

df.select(rem_udf('text'))

unfortunatelly, unidecode module is not available in our cluster.

Is there some any natural solution that I am missing excepting manual replacement of all possible characters? Note that the expected result is [padlo, drez]

Upvotes: 1

Views: 1705

Answers (1)

Tetlanesh
Tetlanesh

Reputation: 403

You can use analog of SQL translate to replace character based on two "dictionaries":

from pyspark.sql.session import SparkSession
from pyspark import SparkContext
import pyspark.sql.functions as sf
from pyspark.sql.types import StringType

charsFrom = 'řá' #fill those strings with all diactricts
charsTo   = 'ra' #and coresponding latins

df = sc.parallelize([(u'pádlo', 1), (u'dřez', 4)]).toDF(['text', 'num'])
df = df.select(sf.translate('text', charsFrom, charsTo),'num')

It will replace every occurence of each character from the first string to coresponding character from second string.

Upvotes: 3

Related Questions