rwxrwxr--
rwxrwxr--

Reputation: 45

Spark/scala how to subtract the current column's values from the previous column's?

I have a data frame that looks like this:

+--------------+-------+-------+-------+-------+-------+-------+-------+
|Country/Region| 3/7/20| 3/8/20| 3/9/20|3/10/20|3/11/20|3/12/20|3/13/20|
+--------------+-------+-------+-------+-------+-------+-------+-------+
|       Senegal|      0|      4|     10|     18|     27|     31|     35|
+--------------+-------+-------+-------+-------+-------+-------+-------+
|       Tunisia|      1|      8|     15|     21|     37|     42|     59|
+--------------+-------+-------+-------+-------+-------+-------+-------+

For each country, I have a unique row, but I have many columns representing days. I would like to go through each column and subtract from it the corresponding value in the previous column such as the resulting df should be as follows:

+--------------+-------+-------+-------+-------+-------+-------+-------+
|Country/Region| 3/7/20| 3/8/20| 3/9/20|3/10/20|3/11/20|3/12/20|3/13/20|
+--------------+-------+-------+-------+-------+-------+-------+-------+
|       Senegal|      0|      4|      6|      8|      9|      4|      4|
+--------------+-------+-------+-------+-------+-------+-------+-------+
|       Tunisia|      1|      7|      7|      6|     16|      5|     17|
+--------------+-------+-------+-------+-------+-------+-------+-------+

Upvotes: 1

Views: 538

Answers (1)

Som
Som

Reputation: 6338

Perhaps this is helpful-

df2.show(false)
    df2.printSchema()
    /**
      * +--------------+------+------+------+-------+-------+-------+-------+
      * |Country/Region|3/7/20|3/8/20|3/9/20|3/10/20|3/11/20|3/12/20|3/13/20|
      * +--------------+------+------+------+-------+-------+-------+-------+
      * |Senegal       |0     |4     |10    |18     |27     |31     |35     |
      * |Tunisia       |1     |8     |15    |21     |37     |42     |59     |
      * +--------------+------+------+------+-------+-------+-------+-------+
      *
      * root
      * |-- Country/Region: string (nullable = true)
      * |-- 3/7/20: integer (nullable = true)
      * |-- 3/8/20: integer (nullable = true)
      * |-- 3/9/20: integer (nullable = true)
      * |-- 3/10/20: integer (nullable = true)
      * |-- 3/11/20: integer (nullable = true)
      * |-- 3/12/20: integer (nullable = true)
      * |-- 3/13/20: integer (nullable = true)
      */

    val new_df = df2.withColumn("01/01/70", lit(0))
    val tuples = new_df.schema.filter(_.dataType.isInstanceOf[NumericType])
      .map(_.name)
      .map(c => {
      val sdf = new SimpleDateFormat("MM/dd/yy")
      (sdf.parse(c), c)
    }).sortBy(_._1)
      .map(_._2)
      .sliding(2, 1)
      .map(seq => (col(seq.last) - col(seq.head)).as(seq.last))

    new_df.select(col("Country/Region") +: tuples.toSeq: _* )
      .show(false)

    /**
      * +--------------+------+------+------+-------+-------+-------+-------+
      * |Country/Region|3/7/20|3/8/20|3/9/20|3/10/20|3/11/20|3/12/20|3/13/20|
      * +--------------+------+------+------+-------+-------+-------+-------+
      * |Senegal       |0     |4     |6     |8      |9      |4      |4      |
      * |Tunisia       |1     |7     |7     |6      |16     |5      |17     |
      * +--------------+------+------+------+-------+-------+-------+-------+
      */

Upvotes: 1

Related Questions