Reputation: 771
I am wondering how to remove diacritics in Pyspark Dataframe with Python2. I would need something like
from pyspark.sql.session import SparkSession
from pyspark import SparkContext
import pyspark.sql.functions as sf
from pyspark.sql.types import StringType
df = sc.parallelize([(u'pádlo', 1), (u'dřez', 4)]).toDF(['text', 'num'])
def remove_diacritics(s):
return unidecode.unidecode(s)
rem_udf = sf.udf(remove_diacritics, StringType())
df.select(rem_udf('text'))
unfortunatelly, unidecode
module is not available in our cluster.
Is there some any natural solution that I am missing excepting manual replacement of all possible characters? Note that the expected result is [padlo, drez]
Upvotes: 1
Views: 1705
Reputation: 403
You can use analog of SQL translate to replace character based on two "dictionaries":
from pyspark.sql.session import SparkSession
from pyspark import SparkContext
import pyspark.sql.functions as sf
from pyspark.sql.types import StringType
charsFrom = 'řá' #fill those strings with all diactricts
charsTo = 'ra' #and coresponding latins
df = sc.parallelize([(u'pádlo', 1), (u'dřez', 4)]).toDF(['text', 'num'])
df = df.select(sf.translate('text', charsFrom, charsTo),'num')
It will replace every occurence of each character from the first string to coresponding character from second string.
Upvotes: 3