Reputation: 163
I have this dataframe
I want to perform a substring operation based on positions of letters so that the output will be like this
For creating new columns i hv to use substring operation
I have written a little spark scala code like below
val df=data.toDF("HI")
df.show()
val df1=df.select(col("*"), substring(col("HI"), 0, 2).as("c"))
val df2=df1.select(col("*"), substring(col("HI"), 3, 4).as("d"))
val df3=df2.select(col("*"), substring(col("d"), 3, 4).as("e")).show()
now the problem I am facing
1>I am not able to get the actual result
2>Even if I could manage to get the result,This approach is wrong because of creating so many extra dataframes just to get the actual result
Is there any way I can perform multiple substring operation on a single dataframe and optimize the code a little bit??
Upvotes: 1
Views: 2767
Reputation: 14905
The substring function and withColumn should do it:
import org.apache.spark.sql.functions._
val df = data.toDF("HI")
df.withColumn("c", substring(col("HI"), 0, 2))
.withColumn("d", substring(col("HI"), 3, 2))
.withColumn("e", substring(col("HI"), 5, 2))
.show()
prints
+------+---+---+---+
| HI| c| d| e|
+------+---+---+---+
|abcdef| ab| cd| ef|
|ghijkl| gh| ij| kl|
|mnopqr| mn| op| qr|
|stuvwx| st| uv| wx|
| yz| yz| | |
+------+---+---+---+
Looking at the output of Dataset.explain, the three string operations are executed together in a single stage, so there should be no performance issues:
== Physical Plan ==
*(1) Project [HI#10, substring(HI#10, 0, 2) AS c#12, substring(HI#10, 3, 2) AS d#15, substring(HI#10, 5, 2) AS e#19]
+- *(1) FileScan csv [HI#10] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/home/.../test.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<HI:string>
(I am reading the input data from a csv file)
Upvotes: 3
Reputation: 5487
Hope this will help,
val df= Seq("abcdef", "ghijkl", "mnopqr", "stuvwx", "yz").toDF("mainCol")
df.select(Array(col("mainCol")) ++ (0 to 4 by 2).map(i => concat(split('mainCol, "").getItem(i), split('mainCol, "").getItem(i + 1)).as(s"subColumn$i")): _*).show()
Upvotes: 0