Reputation:
I have a requirement to populate failed time stamp date. Because when the date value is not proper then unix_timestamp populates "null". In this case i want to populate actual failed date value in Timestamp instead of "null".
DataFrame:
id,date
1,161129
2,961209
3,110620
4,160928
5,021002
6,160421
7,160121
8,100903
9,077707
here id 9 has improper data which not in a dateformate. so if i use unix_timestamp("date",yymmdd).cast("timestamp")
it returns null
for row 9. but i want actuall date if fails. i.e "077707"
expected output
+---+-------------------+
| id| date|
+---+-------------------+
| 1|2016-11-29 00:00:00|
| 2|1996-12-09 00:00:00|
| 3|2011-06-20 00:00:00|
| 4|2016-09-28 00:00:00|
| 5|2002-10-02 00:00:00|
| 6|2016-04-21 00:00:00|
| 7|2016-01-21 00:00:00|
| 8|2010-09-03 00:00:00|
| 9|077707 |
+---+-------------------+
Upvotes: 1
Views: 6137
Reputation: 8513
You can use a conditional (when/otherwise) to select the raw or parsed date value. It would look something like this:
from pyspark.sql.functions import unix_timestamp, when, col
data = [
('1', '161129'),
('2', '961209'),
('3', '110620'),
('4', '160928'),
('5', '021002'),
('6', '160421'),
('7', '160121'),
('8', '100903'),
('9', '077707')
]
df = spark.createDataFrame(sc.parallelize(data), ['id', 'date'])
df = df.withColumn('parsed', unix_timestamp('date', 'yyMMdd').cast('timestamp'))
df = df.withColumn('combined', when(col('parsed').isNull(), col('date'))
.otherwise(col('parsed')))
df.show(100, False)
Upvotes: 3