Hannah
Hannah

Reputation: 163

Apply Substring operation in Dataframe to create new column

I have this dataframe

enter image description here

I want to perform a substring operation based on positions of letters so that the output will be like this

For creating new columns i hv to use substring operation

enter image description here

I have written a little spark scala code like below

 val df=data.toDF("HI")
   df.show()
   val df1=df.select(col("*"), substring(col("HI"), 0, 2).as("c"))
  val df2=df1.select(col("*"), substring(col("HI"), 3, 4).as("d"))
  val df3=df2.select(col("*"), substring(col("d"), 3, 4).as("e")).show()

now the problem I am facing

1>I am not able to get the actual result

2>Even if I could manage to get the result,This approach is wrong because of creating so many extra dataframes just to get the actual result

Is there any way I can perform multiple substring operation on a single dataframe and optimize the code a little bit??

Upvotes: 1

Views: 2767

Answers (2)

werner
werner

Reputation: 14905

The substring function and withColumn should do it:

import org.apache.spark.sql.functions._
val df = data.toDF("HI")
df.withColumn("c", substring(col("HI"), 0, 2))
  .withColumn("d", substring(col("HI"), 3, 2))
  .withColumn("e", substring(col("HI"), 5, 2))
.show()

prints

+------+---+---+---+
|    HI|  c|  d|  e|
+------+---+---+---+
|abcdef| ab| cd| ef|
|ghijkl| gh| ij| kl|
|mnopqr| mn| op| qr|
|stuvwx| st| uv| wx|
|    yz| yz|   |   |
+------+---+---+---+

Looking at the output of Dataset.explain, the three string operations are executed together in a single stage, so there should be no performance issues:

== Physical Plan ==
*(1) Project [HI#10, substring(HI#10, 0, 2) AS c#12, substring(HI#10, 3, 2) AS d#15, substring(HI#10, 5, 2) AS e#19]
+- *(1) FileScan csv [HI#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/.../test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<HI:string>

(I am reading the input data from a csv file)

Upvotes: 3

Mohana B C
Mohana B C

Reputation: 5487

Hope this will help,

 val df= Seq("abcdef", "ghijkl", "mnopqr", "stuvwx", "yz").toDF("mainCol")
df.select(Array(col("mainCol")) ++ (0 to 4 by 2).map(i => concat(split('mainCol, "").getItem(i), split('mainCol, "").getItem(i + 1)).as(s"subColumn$i")): _*).show()

Upvotes: 0

Related Questions