deals my
deals my

Reputation: 133

Spark Java - Adding new column based on date in Oracle dataset

Trying to write Spark java Program to add a column in dataset based on the date.I am using Oracle database.

Need to add new column (yearquarter) based on posteddate using Spark Java. For example... If posteddate falls between Jan 1st to March 31st of 2020, yearquarter value would be Q12020. You can see the Current data set and expected dataset below...

Can some one let me know the piece of spark java code that adds new column..

Code Snippet for reading from table /inputdataset

Dataset<Row> inputDataset = sparksession.read().jdbc(jdbcUrl, table_name, connectionProperties);
    inputDataset.show(); 

DataSets

  Current Dateset(inputDataset):-
                   +------+--------+---------------------|
                   | ID   |location| posteddate          |
                   +------+--------+---------------------+
                   |137570|chennai |2020-06-22 13:49:... |
                   |137571| kerala |2020-02-22 14:49:... |
                   |137572|chennai |2018-10-26 13:19:... |
                   |137573|chennai |2019-09-29 14:49:... |
                   +------+-------+---------------------+
           
           
           Expected DataSet:-
                   +------+--------+---------------------+--------------+
                   |   id |location| posteddate          |  yearquarter |
                   +------+--------+---------------------+--------------+
                   |137570|chennai |2020-06-22 13:49:... |        Q22020|
                   |137571| kerala |2020-02-22 14:49:... |        Q12020|
                   |137572|chennai |2018-10-26 13:19:... |        Q42018|
                   |137573|chennai |2019-09-29 14:49:... |        Q32019|
                   +------+--------+---------------------+--------------+ 

Thanks in advance

Upvotes: 1

Views: 189

Answers (2)

Som
Som

Reputation: 6338

Try this-

Use quarter + year

   dataset.show(false);
        dataset.printSchema();
        /**
         * +------+--------+-------------------+
         * |ID    |location|posteddate         |
         * +------+--------+-------------------+
         * |137570|chennai |2020-06-22 13:49:00|
         * |137571|kerala  |2020-02-22 14:49:00|
         * |137572|chennai |2018-10-26 13:19:00|
         * |137573|chennai |2019-09-29 14:49:00|
         * +------+--------+-------------------+
         *
         * root
         *  |-- ID: integer (nullable = true)
         *  |-- location: string (nullable = true)
         *  |-- posteddate: timestamp (nullable = true)
         */

        dataset.withColumn("yearquarter", expr("concat('Q', quarter(posteddate), year(posteddate))"))
                .show(false);
        /**
         * +------+--------+-------------------+-----------+
         * |ID    |location|posteddate         |yearquarter|
         * +------+--------+-------------------+-----------+
         * |137570|chennai |2020-06-22 13:49:00|Q22020     |
         * |137571|kerala  |2020-02-22 14:49:00|Q12020     |
         * |137572|chennai |2018-10-26 13:19:00|Q42018     |
         * |137573|chennai |2019-09-29 14:49:00|Q32019     |
         * +------+--------+-------------------+-----------+
         */

Upvotes: 1

fg78nc
fg78nc

Reputation: 5232

Imports:

import static org.apache.spark.sql.functions.lit
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.when;

You can use this as starting point

inputDataset = inputDataset
   .withColumn( "yearquarter", // adding column
     when(                       // conditional operator
       col("posteddate")         
           .$greater("start_range")   // condition #1
       .and(                      // and 
           col("posteddate").$less("end_range")), // condition #2
                lit("Q12020"))   // column value if condition evaluates to true
     .otherwise(lit("Q22020"))); // column value if condition evaluates to false 

Upvotes: 1

Related Questions