Reputation: 1249
Does anyone know a good way in Scala to explode a row into multiple rows based on a range from two columns?
For example, for the input dataframe:
start_ip_int | end_ip_int | country | city
100 | 105 | USA | Boston
The expected output dataframe is:
start_ip_int | end_ip_int | country | city | ip
100 | 105 | USA | Boston | 100
100 | 105 | USA | Boston | 101
100 | 105 | USA | Boston | 102
100 | 105 | USA | Boston | 103
100 | 105 | USA | Boston | 104
100 | 105 | USA | Boston | 105
So here one row got split into 6 rows based on the range of columns start_ip_int
and end_ip_int
.
Upvotes: 1
Views: 717
Reputation: 22439
If you're on Spark 2.4+
, use sequence
with the IP integer range as arguments to generate an ArrayType
column, followed by explode
-ing it:
val df = Seq((100, 105, "USA", "Boston")).
toDF("start_ip_int", "end_ip_int", "country", "city")
df.withColumn("ip", explode(sequence($"start_ip_int", $"end_ip_int"))).show
// +------------+----------+-------+------+---+
// |start_ip_int|end_ip_int|country| city| ip|
// +------------+----------+-------+------+---+
// | 100| 105| USA|Boston|100|
// | 100| 105| USA|Boston|101|
// | 100| 105| USA|Boston|102|
// | 100| 105| USA|Boston|103|
// | 100| 105| USA|Boston|104|
// | 100| 105| USA|Boston|105|
// +------------+----------+-------+------+---+
For older Spark version, consider creating a simple UDF to mimic the sequence
function:
val rangeSequence = udf{ (lower: Int, upper: Int) =>
Seq.iterate(lower, upper - lower + 1)(_ + 1)
}
// Applying the UDF, followed by `explode`
df.withColumn("ip", explode(rangeSequence($"start_ip_int", $"end_ip_int")))
Upvotes: 4