user8510536
user8510536

Reputation:

How to populate failed time stamp date instead of "null" in pyspark

I have a requirement to populate failed time stamp date. Because when the date value is not proper then unix_timestamp populates "null". In this case i want to populate actual failed date value in Timestamp instead of "null".

DataFrame:

    id,date
    1,161129
    2,961209
    3,110620
    4,160928
    5,021002
    6,160421
    7,160121
    8,100903
    9,077707

here id 9 has improper data which not in a dateformate. so if i use unix_timestamp("date",yymmdd).cast("timestamp") it returns null for row 9. but i want actuall date if fails. i.e "077707"

expected output

+---+-------------------+
| id|               date|           
+---+-------------------+
|  1|2016-11-29 00:00:00|
|  2|1996-12-09 00:00:00|
|  3|2011-06-20 00:00:00|
|  4|2016-09-28 00:00:00|
|  5|2002-10-02 00:00:00|
|  6|2016-04-21 00:00:00|
|  7|2016-01-21 00:00:00|
|  8|2010-09-03 00:00:00|
|  9|077707             |
+---+-------------------+

Upvotes: 1

Views: 6137

Answers (1)

Ryan Widmaier
Ryan Widmaier

Reputation: 8513

You can use a conditional (when/otherwise) to select the raw or parsed date value. It would look something like this:

from pyspark.sql.functions import unix_timestamp, when, col

data = [
  ('1', '161129'),
  ('2', '961209'),
  ('3', '110620'),
  ('4', '160928'),
  ('5', '021002'),
  ('6', '160421'),
  ('7', '160121'),
  ('8', '100903'),
  ('9', '077707')
]

df = spark.createDataFrame(sc.parallelize(data), ['id', 'date'])


df = df.withColumn('parsed', unix_timestamp('date', 'yyMMdd').cast('timestamp'))
df = df.withColumn('combined', when(col('parsed').isNull(), col('date'))
                                 .otherwise(col('parsed')))
df.show(100, False)

Upvotes: 3

Related Questions