Etisha
Etisha

Reputation: 317

substring from lastIndexOf in spark scala

I have a column in my dataframe which contains the filename

test_1_1_1_202012010101101

I want to get the string after the lastIndexOf(_) I tried this and it is working

val timestamp_df =file_name_df.withColumn("timestamp",split(col("filename"),"_").getItem(4))

But I want to make it more generic, so that if in future if the filename can have any number of _ in it, it can split it on the basis of lastIndexOf _

val timestamp_df =file_name_df.withColumn("timestamp", expr("substring(filename, length(filename)-15,17)"))

This also is not generic as the character length can vary.

Can anyone help me in using the lastIndexOf function with withColumn.

Upvotes: 0

Views: 3420

Answers (3)

Etisha
Etisha

Reputation: 317

val timestamp_df =file_name_df.withColumn("timestamp",reverse(split(reverse(col("filename")),"_").getItem(0)))

It's working with this.

Upvotes: 0

notNull
notNull

Reputation: 31490

You can use element_at function with split to get last element of array.

Example:

df.withColumn("timestamp",element_at(split(col("filename"),"_"),-1)).show(false)
+--------------------------+---------------+
|filename                  |timestamp      |
+--------------------------+---------------+
|test_1_1_1_202012010101101|202012010101101|
+--------------------------+---------------+

Upvotes: 1

ELinda
ELinda

Reputation: 2821

You can use substring_index

scala> val df = Seq(("a-b-c", 1),("d-ef-foi",2)).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: string, c2: int]

+--------+---+
|      c1| c2|
+--------+---+
|   a-b-c|  1|
|d-ef-foi|  2|
+--------+---+

scala> df.withColumn("c3", substring_index(col("c1"), "-", -1)).show
+--------+---+---+
|      c1| c2| c3|
+--------+---+---+
|   a-b-c|  1|  c|
|d-ef-foi|  2|foi|
+--------+---+---+

Per docs: When the last argument "is negative, everything to the right of the final delimiter (counting from the right) is returned"

Upvotes: 0

Related Questions