Reputation: 15
i have a tuple of 2 elements like this :
Tuple2(“String1, String2”, ArrayList(“String3”, “String4”))
=> 1st element is a string with comma separated string values
=> 2nd element is an arraylist that contains a list of string
i would like to have a data frame like this :
Col1 Col2 Col3
1 String1 String3
2 String1 String4
3 String2 String3
4 String2 String4
Upvotes: 1
Views: 654
Reputation: 1529
import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}
df
// `split` "String1, String2" into separate values, then create a row per value using `explode`
.withColumn("Col2", explode(split(col("_1"), ", ")))
// create a row per value in the list: "String3", "String4"
.withColumn("Col3", explode(col("_2")))
// now that we have our 4 rows, add a new column with an incrementing number
.withColumn("Col1", monotonically_increasing_id() + 1)
// only keep the columns we care about
.select("Col1", "Col2", "Col3")
.show(false)
Starting with your example:
val tuple2 = Tuple2("String1, String2", List("String3", "String4"))
and turning it into a DataFrame:
val df = List(tuple2).toDF("_1", "_2")
df.show(false)
which gives:
+----------------+------------------+
|_1 |_2 |
+----------------+------------------+
|String1, String2|[String3, String4]|
+----------------+------------------+
Now we are ready for the transformation:
import org.apache.spark.sql.functions.{col, explode, monotonically_increasing_id, split}
df
// `split` "String1, String2" into separate values, then create a row per value using `explode`
.withColumn("Col2", explode(split(col("_1"), ", ")))
// create a row per value in the list: "String3", "String4"
.withColumn("Col3", explode(col("_2")))
// now that we have our 4 rows, add a new column with an incrementing number
.withColumn("Col1", monotonically_increasing_id() + 1)
// only keep the columns we care about
.select("Col1", "Col2", "Col3")
.show(false)
which gives:
+----+-------+-------+
|Col1|Col2 |Col3 |
+----+-------+-------+
|1 |String1|String3|
|2 |String1|String4|
|3 |String2|String3|
|4 |String2|String4|
+----+-------+-------+
It is worth noting that the order of the operations is key:
"String1"
and "String2"
into their own rows:df
.withColumn("Col2", explode(split(col("_1"), ", ")))
.select("Col2")
.show(false)
gives:
+-------+
|Col2 |
+-------+
|String1|
|String2|
+-------+
where we go from 1 original row, to two.
"String3", "String4"
:df
.withColumn("Col2", explode(split(col("_1"), ", ")))
.withColumn("Col3", explode(col("_2")))
.select("Col2", "Col3")
.show(false)
gives:
+-------+-------+
|Col2 |Col3 |
+-------+-------+
|String1|String3|
|String1|String4|
|String2|String3|
|String2|String4|
+-------+-------+
for example:
df
// here we add `Col1` to a Dataset of only one row! So we only have the value `1`
.withColumn("Col1", monotonically_increasing_id() + 1)
// here we explode row 1, copying the value of `Col1`
.withColumn("Col2", explode(split(col("_1"), ", ")))
.withColumn("Col3", explode(col("_2")))
.select("Col1", "Col2", "Col3")
.show(false)
gives:
+----+-------+-------+
|Col1|Col2 |Col3 |
+----+-------+-------+
|1 |String1|String3|
|1 |String1|String4|
|1 |String2|String3|
|1 |String2|String4|
+----+-------+-------+
Upvotes: 1