techknowfile
techknowfile

Reputation: 343

Creating a new dataframe with many rows for each row in existing dataframe

I currently have a dataframe

df1 =
+-----+
|  val|
+-----+
|    1|
|    2|
|    3|
  ....
| 2456|
+-----+

Each value corresponds to a single cell in a 3d cube. I have a function findNeighbors which returns a list of the neighboring cubes, which I then map to df1 to get the neighbors of every row.

df2 = df1.map(row => findNeighbors(row(0).toInt)

This results in something like

df2 =
+---------------+
|      neighbors|
+---------------+
|  (1,2), (1, 7)|
|  (2,1), (2, 3)|
  .... etc
+---------------+

Where, for each row, for each Array in that row, the first item is the value of the cell and the second is the value of its neighbor.

I now want to create a new dataframe that takes all of those nested arrays and makes them rows like this:

finalDF = 
    +-----+------+
    | cell|neighb|
    +-----+------+
    |    1|     2|
    |    1|     7|
    |    2|     1|
    |    2|     3|
      .... etc 
    +------------+

And this is where I am stuck

I tried using the code below, but I can't append to a local dataframe from within the foreach function.

var df: DataFrame = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], my_schema)
val colNames = Seq("cell", "neighb")
neighborsDf.foreach(row => {
      var rowDf: DataFrame = row.toDF(colNames: _*)
      df.union(rowDf)
    })

I'm sure there is a much better way to approach this problem, but I'm very new and very lost in scala/spark, and 10 hours of googling hasn't helped me.

Upvotes: 1

Views: 172

Answers (1)

Ged
Ged

Reputation: 18013

Starting a little down the track, a somewhat similar example:

val df2 = df.select(explode($"neighbours").as("neighbours_flat"))

val df3 = df2.select(col("neighbours_flat").getItem(0) as "cell",col("neighbours_flat")
            .getItem(1) as "neighbour")
df3.show(false)

starting from neighbours field def:

+----------------+
|neighbours_flat |
+----------------+
|[[1, 2], [1, 7]]|
|[[2, 1], [2, 3]]|
+----------------+

results in:

+----+---------+
|cell|neighbour|
+----+---------+
|1   |2        |
|1   |7        |
|2   |1        |
|2   |3        |
+----+---------+

You need to have an array def and then use explode.

Upvotes: 1

Related Questions