Reputation: 343
I currently have a dataframe
df1 =
+-----+
| val|
+-----+
| 1|
| 2|
| 3|
....
| 2456|
+-----+
Each value corresponds to a single cell in a 3d cube. I have a function findNeighbors which returns a list of the neighboring cubes, which I then map to df1 to get the neighbors of every row.
df2 = df1.map(row => findNeighbors(row(0).toInt)
This results in something like
df2 =
+---------------+
| neighbors|
+---------------+
| (1,2), (1, 7)|
| (2,1), (2, 3)|
.... etc
+---------------+
Where, for each row, for each Array in that row, the first item is the value of the cell and the second is the value of its neighbor.
I now want to create a new dataframe that takes all of those nested arrays and makes them rows like this:
finalDF =
+-----+------+
| cell|neighb|
+-----+------+
| 1| 2|
| 1| 7|
| 2| 1|
| 2| 3|
.... etc
+------------+
And this is where I am stuck
I tried using the code below, but I can't append to a local dataframe from within the foreach function.
var df: DataFrame = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], my_schema)
val colNames = Seq("cell", "neighb")
neighborsDf.foreach(row => {
var rowDf: DataFrame = row.toDF(colNames: _*)
df.union(rowDf)
})
I'm sure there is a much better way to approach this problem, but I'm very new and very lost in scala/spark, and 10 hours of googling hasn't helped me.
Upvotes: 1
Views: 172
Reputation: 18013
Starting a little down the track, a somewhat similar example:
val df2 = df.select(explode($"neighbours").as("neighbours_flat"))
val df3 = df2.select(col("neighbours_flat").getItem(0) as "cell",col("neighbours_flat")
.getItem(1) as "neighbour")
df3.show(false)
starting from neighbours field def:
+----------------+
|neighbours_flat |
+----------------+
|[[1, 2], [1, 7]]|
|[[2, 1], [2, 3]]|
+----------------+
results in:
+----+---------+
|cell|neighbour|
+----+---------+
|1 |2 |
|1 |7 |
|2 |1 |
|2 |3 |
+----+---------+
You need to have an array def and then use explode.
Upvotes: 1