Apply Substring operation in Dataframe to create new column

Question

I have this dataframe

I want to perform a substring operation based on positions of letters so that the output will be like this

For creating new columns i hv to use substring operation

I have written a little spark scala code like below

 val df=data.toDF("HI")
   df.show()
   val df1=df.select(col("*"), substring(col("HI"), 0, 2).as("c"))
  val df2=df1.select(col("*"), substring(col("HI"), 3, 4).as("d"))
  val df3=df2.select(col("*"), substring(col("d"), 3, 4).as("e")).show()

now the problem I am facing

1>I am not able to get the actual result

2>Even if I could manage to get the result,This approach is wrong because of creating so many extra dataframes just to get the actual result

Is there any way I can perform multiple substring operation on a single dataframe and optimize the code a little bit??

werner · Accepted Answer

The substring function and withColumn should do it:

import org.apache.spark.sql.functions._
val df = data.toDF("HI")
df.withColumn("c", substring(col("HI"), 0, 2))
  .withColumn("d", substring(col("HI"), 3, 2))
  .withColumn("e", substring(col("HI"), 5, 2))
.show()

prints

+------+---+---+---+
|    HI|  c|  d|  e|
+------+---+---+---+
|abcdef| ab| cd| ef|
|ghijkl| gh| ij| kl|
|mnopqr| mn| op| qr|
|stuvwx| st| uv| wx|
|    yz| yz|   |   |
+------+---+---+---+

Looking at the output of Dataset.explain, the three string operations are executed together in a single stage, so there should be no performance issues:

== Physical Plan ==
*(1) Project [HI#10, substring(HI#10, 0, 2) AS c#12, substring(HI#10, 3, 2) AS d#15, substring(HI#10, 5, 2) AS e#19]
+- *(1) FileScan csv [HI#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/.../test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct

(I am reading the input data from a csv file)

Apply Substring operation in Dataframe to create new column

Answers (2)

Related Questions