Reputation: 133
Trying to write Spark java Program to add a column in dataset based on the date.I am using Oracle database.
Need to add new column (yearquarter) based on posteddate using Spark Java. For example... If posteddate falls between Jan 1st to March 31st of 2020, yearquarter value would be Q12020. You can see the Current data set and expected dataset below...
Can some one let me know the piece of spark java code that adds new column..
Code Snippet for reading from table /inputdataset
Dataset<Row> inputDataset = sparksession.read().jdbc(jdbcUrl, table_name, connectionProperties);
inputDataset.show();
DataSets
Current Dateset(inputDataset):-
+------+--------+---------------------|
| ID |location| posteddate |
+------+--------+---------------------+
|137570|chennai |2020-06-22 13:49:... |
|137571| kerala |2020-02-22 14:49:... |
|137572|chennai |2018-10-26 13:19:... |
|137573|chennai |2019-09-29 14:49:... |
+------+-------+---------------------+
Expected DataSet:-
+------+--------+---------------------+--------------+
| id |location| posteddate | yearquarter |
+------+--------+---------------------+--------------+
|137570|chennai |2020-06-22 13:49:... | Q22020|
|137571| kerala |2020-02-22 14:49:... | Q12020|
|137572|chennai |2018-10-26 13:19:... | Q42018|
|137573|chennai |2019-09-29 14:49:... | Q32019|
+------+--------+---------------------+--------------+
Thanks in advance
Upvotes: 1
Views: 189
Reputation: 6338
Try this-
Use quarter
+ year
dataset.show(false);
dataset.printSchema();
/**
* +------+--------+-------------------+
* |ID |location|posteddate |
* +------+--------+-------------------+
* |137570|chennai |2020-06-22 13:49:00|
* |137571|kerala |2020-02-22 14:49:00|
* |137572|chennai |2018-10-26 13:19:00|
* |137573|chennai |2019-09-29 14:49:00|
* +------+--------+-------------------+
*
* root
* |-- ID: integer (nullable = true)
* |-- location: string (nullable = true)
* |-- posteddate: timestamp (nullable = true)
*/
dataset.withColumn("yearquarter", expr("concat('Q', quarter(posteddate), year(posteddate))"))
.show(false);
/**
* +------+--------+-------------------+-----------+
* |ID |location|posteddate |yearquarter|
* +------+--------+-------------------+-----------+
* |137570|chennai |2020-06-22 13:49:00|Q22020 |
* |137571|kerala |2020-02-22 14:49:00|Q12020 |
* |137572|chennai |2018-10-26 13:19:00|Q42018 |
* |137573|chennai |2019-09-29 14:49:00|Q32019 |
* +------+--------+-------------------+-----------+
*/
Upvotes: 1
Reputation: 5232
Imports:
import static org.apache.spark.sql.functions.lit
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.when;
You can use this as starting point
inputDataset = inputDataset
.withColumn( "yearquarter", // adding column
when( // conditional operator
col("posteddate")
.$greater("start_range") // condition #1
.and( // and
col("posteddate").$less("end_range")), // condition #2
lit("Q12020")) // column value if condition evaluates to true
.otherwise(lit("Q22020"))); // column value if condition evaluates to false
Upvotes: 1