Reputation: 1330
I have a DataFrame
and i have applied different functions to replace trailing space but there is no luck
df.select(col("e_no"),regexp_replace(col("e_no"),"//s+$",""),rtrim(col("e_no")),length(col("e_no"))).show()
e_no | regexp_replace(e_no),//s+$, )| rtrim(e_no)| length(e_no)
525071 | 525071 | 525071 | 7
512938| 512938| 512938| 6
522783 | 522783 | 522783 | 7
please could you advise.
Upvotes: 1
Views: 4894
Reputation: 7928
With withColumn
method is the easiest way for my understanding:
import org.apache.spark.sql.functions._
val newDF = rawDF.withColumn("e_no", rtrim(col("e_no")))
newDF.show()
Also you can use an UDF, but it may be overkilling the issue. Anyway is useful to get used to use UDFs, they are pretty helpful
val removeTrailingWhitespaces = (s:String) => {
s.trim
}
val removeTrailingWhitespacesUdf = functions.udf(removeTrailingWhitespaces)
val newDF = rawDF
.withColumn("e_no", removeTrailingWhitespacesUdf(rawDF("e_no")))
newDF.show()
I've run some tests for both options:
input:
H ,173
M,161
Output
+----+------+
| H| 173|
| M| 161|
+----+------+
Upvotes: 1
Reputation: 1186
The rtrim function should work. As for the regex replace, the correct regular expression will be
"\s+$"
Working code using spark: 2.2.1
import spark.implicits._
import org.apache.spark.sql.functions._
val list = Seq("525071 ", "512938", "522783 ")
val df = list.toDF("e_no")
df.select(
col("e_no"),
regexp_replace(col("e_no"), "\\s+$", ""),
rtrim(col("e_no")),
length(col("e_no"))
).show()
Upvotes: 3