Reputation: 73
Every row in the dataframe contains a csv formatted string line
plus another simple string, so what I'm trying to get at the end is a dataframe composed of the fields extracted from the line
string together with category
.
So I proceeded as follows to explode the line
string
val df = stream.toDF("line","category")
.map(x => x.getString(0))......
At the end I manage to get a new dataframe composed of the line fields but I can't return the category
to the new dataframe
I can't join the new dataframe with the initial one since the common field id
was not a separate column at first.
Sample of input :
line | category
"'1';'daniel';'[email protected]'" | "premium"
Sample of output:
id | name | email | category
1 | "daniel"| "[email protected]"| "premium"
Any suggestions, thanks in advance.
Upvotes: 1
Views: 503
Reputation: 41987
If the structure of strings in line
column is fixed as mentioned in the question, then following simple solution should work where split
inbuilt function is used to split the string into array and then finally selecting the elements from the array and aliasing to get the final dataframe
import org.apache.spark.sql.functions._
df.withColumn("line", split(col("line"), ";"))
.select(col("line")(0).as("id"), col("line")(1).as("name"), col("line")(2).as("email"), col("category"))
.show(false)
which should give you
+---+--------+---------------+--------+
|id |name |email |category|
+---+--------+---------------+--------+
|'1'|'daniel'|'[email protected]'|premium |
+---+--------+---------------+--------+
I hope the answer is helpful
Upvotes: 3