How to compute the numerical difference between columns of different dataframes?

Question

Given two spark dataframes A and B with the same number of columns and rows, I want to compute the numerical difference between the two dataframes and store it into another dataframe (or another data structure optionally).

For instance let us have the following datasets

DataFrame A:

+----+---+
|  A | B |
+----+---+
|   1|  0|
|   1|  0|
+----+---+

DataFrame B:

----+---+
|  A | B |
+----+---+
|   1| 0 |
|   0| 0 |
+----+---+

How to obtain B-A, i.e


+----+---+
| c1 | c2|
+----+---+
|   0| 0 |
|  -1| 0 |
+----+---+

In practice the real dataframes have a consequent number of rows and 50+ columns for which the difference need to be computed. What is the Spark/Scala way of doing it?

pratiksadaphal · Accepted Answer

I was able to solve this by using the approach below. This code can work with any number of columns. You just have to change the input DFs accordingly.

import org.apache.spark.sql.Row

val df0 = Seq((1, 5), (1, 4)).toDF("a", "b")
val df1 = Seq((1, 0), (3, 2)).toDF("a", "b")

val columns = df0.columns
    val rdd = df0.rdd.zip(df1.rdd).map {
      x =>
        val arr = columns.map(column =>
          x._2.getAs[Int](column) - x._1.getAs[Int](column))
        Row(arr: _*)
    }

spark.createDataFrame(rdd, df0.schema).show(false)

Output generated:

df0=>
+---+---+
|a  |b  |
+---+---+
|1  |5  |
|1  |4  |
+---+---+
df1=>
+---+---+
|a  |b  |
+---+---+
|1  |0  |
|3  |2  |
+---+---+
Output=>
+---+---+
|a  |b  |
+---+---+
|0  |-5 |
|2  |-2 |
+---+---+

How to compute the numerical difference between columns of different dataframes?

Answers (2)

Related Questions