Reputation: 45
I have a column datetime object declare as decimal (38,0) not timestamp or date and the data input is 'yyyMMdd'. How do I select data with that column convert as date format as 'yyyy-MM-dd' in spark sql (scala) within a day or two days old?
I have tried:
select count(*) from table_name where to_date('column_name','yyyy-MM-dd') = date_sub(current_date(),1));
this gives me 0 count when a data have quiet more than 500000 records
I tried:
select count(*) from table_name where from_unixtime(cast(load_dt_id as string), 'yyyy-MM-dd') = date_sub(current_date(), 1));
I got data in year 1970-01-31 which those year data are not in the table, even when I select that column where it's like '1970%', I got "OK" with bulk sign that accelerate query with Delta. The data select in order of that column started with 20140320
Upvotes: 2
Views: 1108
Reputation: 8758
The format argument for to_date is the format of the input, not the desired output. Assuming you have yyyymmdd:
Seq(("20200208")).toDF("RawDate").select(col("RawDate"),to_date(col("RawDate"),"yyyyMMdd").as("formatted_date")).show()
+--------+--------------+
| RawDate|formatted_date|
+--------+--------------+
|20200208| 2020-02-08|
+--------+--------------+
Expanding this to filter by the derived date column:
val raw = Seq(("20200208"),("20200209"),("20200210")).toDF("RawDate")
raw: org.apache.spark.sql.DataFrame = [RawDate: string]
raw.select(col("RawDate"),to_date(col("RawDate"),"yyyyMMdd").as("formatted_date")).filter($"formatted_date".geq(date_add(current_date,-1))).show
+--------+--------------+
| RawDate|formatted_date|
+--------+--------------+
|20200209| 2020-02-09|
|20200210| 2020-02-10|
+--------+--------------+
Upvotes: 2