Reputation: 39790
I've got a dataset like the one shown below:
! Hello World. 1
" Hi there. 0
What I want to do, is to remove all the special characters from the beginning of each row (just from the beginning, not the rest of the special characters).
In order to read the data (tab-separated) I use the following code:
val data = sparkSession.read.format("com.databricks.spark.csv")
.option("delimiter", "\t")
.load("data.txt")
val columns = Seq("text", "class")
val df = data.toDF(columns: _*)
I am aware that I should use replaceAll()
but I am not quite sure how to do it.
Upvotes: 1
Views: 3503
Reputation: 214927
You can create a udf
and apply it to the first column of your data frame to remove leading special characters:
val df = Seq(("! Hello World.", 1), ("\" Hi there.", 0)).toDF("text", "class")
df.show
+--------------+-----+
| text|class|
+--------------+-----+
|! Hello World.| 1|
| " Hi there.| 0|
+--------------+-----+
import org.apache.spark.sql.functions.udf
^
// remove leading non-word characters from a string
def remove_leading: String => String = _.replaceAll("^\\W+", "")
val udf_remove = udf(remove_leading)
df.withColumn("text", udf_remove($"text")).show
+------------+-----+
| text|class|
+------------+-----+
|Hello World.| 1|
| Hi there.| 0|
+------------+-----+
Upvotes: 1
Reputation: 2281
May be it will help
val str = " some string "
str.trim
or trim some specific character
str.stripPrefix(",").stripSuffix(",").trim
or removing some characters from front
val ignoreable = ", \t\r\n"
str.dropWhile(c => ignorable.indexOf(c) >= 0)
All the useful ops with string can be found at
Upvotes: 1