sghayth
sghayth

Reputation: 73

Spark dataframe explode column

Every row in the dataframe contains a csv formatted string line plus another simple string, so what I'm trying to get at the end is a dataframe composed of the fields extracted from the line string together with category. So I proceeded as follows to explode the line string

val df = stream.toDF("line","category")
.map(x => x.getString(0))......

At the end I manage to get a new dataframe composed of the line fields but I can't return the category to the new dataframe I can't join the new dataframe with the initial one since the common field id was not a separate column at first.

Sample of input :

line                           | category 
"'1';'daniel';'[email protected]'" | "premium"

Sample of output:

id  | name    | email          | category
1   | "daniel"| "[email protected]"| "premium"

Any suggestions, thanks in advance.

Upvotes: 1

Views: 503

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

If the structure of strings in line column is fixed as mentioned in the question, then following simple solution should work where split inbuilt function is used to split the string into array and then finally selecting the elements from the array and aliasing to get the final dataframe

import org.apache.spark.sql.functions._
df.withColumn("line", split(col("line"), ";"))
  .select(col("line")(0).as("id"), col("line")(1).as("name"), col("line")(2).as("email"), col("category"))
  .show(false)

which should give you

+---+--------+---------------+--------+
|id |name    |email          |category|
+---+--------+---------------+--------+
|'1'|'daniel'|'[email protected]'|premium |
+---+--------+---------------+--------+

I hope the answer is helpful

Upvotes: 3

Related Questions