Spark dataframe explode column

Question

Every row in the dataframe contains a csv formatted string line plus another simple string, so what I'm trying to get at the end is a dataframe composed of the fields extracted from the line string together with category. So I proceeded as follows to explode the line string

val df = stream.toDF("line","category")
.map(x => x.getString(0))......

At the end I manage to get a new dataframe composed of the line fields but I can't return the category to the new dataframe I can't join the new dataframe with the initial one since the common field id was not a separate column at first.

Sample of input :

line                           | category 
"'1';'daniel';'dan@gmail.com'" | "premium"

Sample of output:

id  | name    | email          | category
1   | "daniel"| "dan@gmail.com"| "premium"

Any suggestions, thanks in advance.

Ramesh Maharjan · Accepted Answer

If the structure of strings in line column is fixed as mentioned in the question, then following simple solution should work where split inbuilt function is used to split the string into array and then finally selecting the elements from the array and aliasing to get the final dataframe

import org.apache.spark.sql.functions._
df.withColumn("line", split(col("line"), ";"))
  .select(col("line")(0).as("id"), col("line")(1).as("name"), col("line")(2).as("email"), col("category"))
  .show(false)

which should give you

+---+--------+---------------+--------+
|id |name    |email          |category|
+---+--------+---------------+--------+
|'1'|'daniel'|'dan@gmail.com'|premium |
+---+--------+---------------+--------+

I hope the answer is helpful

Spark dataframe explode column

Answers (1)

Related Questions