Giorgos Myrianthous
Giorgos Myrianthous

Reputation: 39790

Removing special characters from dataframe rows

I've got a dataset like the one shown below:

! Hello World.  1
" Hi there. 0

What I want to do, is to remove all the special characters from the beginning of each row (just from the beginning, not the rest of the special characters).

In order to read the data (tab-separated) I use the following code:

val data = sparkSession.read.format("com.databricks.spark.csv")
    .option("delimiter", "\t")
    .load("data.txt")

val columns = Seq("text", "class")
val df = data.toDF(columns: _*)

I am aware that I should use replaceAll() but I am not quite sure how to do it.

Upvotes: 1

Views: 3503

Answers (2)

akuiper
akuiper

Reputation: 214927

You can create a udf and apply it to the first column of your data frame to remove leading special characters:

val df = Seq(("! Hello World.", 1), ("\" Hi there.", 0)).toDF("text", "class")

df.show
+--------------+-----+
|          text|class|
+--------------+-----+
|! Hello World.|    1|
|   " Hi there.|    0|
+--------------+-----+    


import org.apache.spark.sql.functions.udf
                                                           ^
// remove leading non-word characters from a string
def remove_leading: String => String = _.replaceAll("^\\W+", "")    
val udf_remove = udf(remove_leading)

df.withColumn("text", udf_remove($"text")).show
+------------+-----+
|        text|class|
+------------+-----+
|Hello World.|    1|
|   Hi there.|    0|
+------------+-----+

Upvotes: 1

FaigB
FaigB

Reputation: 2281

May be it will help

val str = " some string "
str.trim

or trim some specific character

str.stripPrefix(",").stripSuffix(",").trim

or removing some characters from front

val ignoreable = ", \t\r\n"
str.dropWhile(c => ignorable.indexOf(c) >= 0)

All the useful ops with string can be found at

Upvotes: 1

Related Questions