Reputation: 1702
I want to normalize Names of authors by removing the accents
Input: orčpžsíáýd
Output: orcpzsiayd
The code below will allow me the achieve this. How ever I am not sure how i can do this using spark functions where my input is dataframe col.
def stringNormalizer(c : Column) = (
import org.apache.commons.lang.StringUtils
return StringUtils.stripAccents(c.toString)
)
The way i should be able to call it
val normalizedAuthor = flat_author.withColumn("NormalizedAuthor",
stringNormalizer(df_article("authors")))
I have just started learning spark. So please let me know if there is a better way to achieve this without UDFs.
Upvotes: 0
Views: 639
Reputation: 11
Although it doesn't look as pretty, I found that it took half the amount of time to remove accents like this without a UDF:
def withColumnFormated(columnName: String)(df: DataFrame): DataFrame = {
val dfWithColumnUpper = df.withColumn(columnName, upper(col(columnName)))
val accents: Map[String, String] = Map("[ÃÁÀÂÄ]" -> "A", "[ÉÈÊË]" -> "E", "[ÍÌÎÏ]" -> "I",
"[Ñ]" -> "N", "[ÓÒÔÕÖ]" -> "O", "[ÚÙÛÜ]" -> "U",
"[Ç]" -> "C")
accents.foldLeft(dfWithColumnUpper){
(tempDf, replace_element) => tempDf.withColumn(columnName,
regexp_replace(col(columnName),
lit(replace_element._1),
lit(replace_element._2)))
}
}
And then you can apply it like this:
df_article.transform(withColumnFormated("authors"))
Upvotes: 1
Reputation:
It requires an udf:
val stringNormalizer = udf((s: String) => StringUtils.stripAccents(s))
df_article.select(stringNormalizer(col("authors")))
Upvotes: 1