how to convert string to timestamptype in pyspark

I have a csv file with below data

Tran_id,Tran_date1,Tran_date2,Tran_date3
1,2022-07-02T16:53:30.375Z,2022-07-02T16:53:30.3750000+00:00,2022-07-02 16:53:30.3750000+00:00
2,2022-08-02T17:33:10.456Z,2022-08-02T17:33:10.4560000+00:00,2022-08-02 17:33:10.4560000+00:00
3,2022-09-02T18:13:20.375Z,2022-09-02T18:13:20.3750000+00:00,2022-09-02 18:13:20.3750000+00:00
4,2022-09-02T19:23:90.322Z,2022-09-02T19:23:90.3220000+00:00,2022-09-02 19:23:90.3220000+00:00

I want to read this csv file using pyspark and convert the data to below format

root
 |-- Tran_id: integer (nullable = false)
 |-- Tran_date1: TimestampType(nullable = false)
 |-- Tran_date2: TimestampType(nullable = false)
 |-- Tran_date3: TimestampType(nullable = false)

and save this data into hive table by converting the string type to timestamptype

how to convert the string into timestamptype without losing the format

Upvotes: 1

Views: 903

Answers (2)

CRAFTY DBA
CRAFTY DBA

Reputation: 14925

I chose to show you have to solve this problem with Spark SQL.

#
# Create sample dataframe + view
#

# array of tuples - data
dat1 = [
  (1,"2022-07-02T16:53:30.375Z","2022-07-02T16:53:30.3750000+00:00","2022-07-02 16:53:30.3750000+00:00"),
  (2,"2022-08-02T17:33:10.456Z","2022-08-02T17:33:10.4560000+00:00","2022-08-02 17:33:10.4560000+00:00"),
  (3,"2022-09-02T18:13:20.375Z","2022-09-02T18:13:20.3750000+00:00","2022-09-02 18:13:20.3750000+00:00"),
  (4,"2022-09-02T19:23:90.322Z","2022-09-02T19:23:90.3220000+00:00","2022-09-02 19:23:90.3220000+00:00")
]

# array of names - columns
col1 = ["Tran_id", "Tran_date1", "Tran_date2", "Tran_date3"]

# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)

# make temp hive view
df1.createOrReplaceTempView("sample_data")

# show schema
df1.printSchema()

Instead of creating a file on my Azure Databrick's storage, I decided to use an array of tuples since this is a very simple file.

We can see from the image below that the last three fields are strings.

enter image description here

The code below uses spark functions to convert the information from string to timestamp.

See this link for Spark SQL functions.

https://spark.apache.org/docs/2.3.0/api/sql/index.html#to_timestamp

#
# Convert data using PySpark SQL
#

# sql stmt
stmt = """
  select
    Tran_id,
    to_timestamp(Tran_date1) as Tran_Ts1,
    to_timestamp(Tran_date2) as Tran_Ts2,
    to_timestamp(Tran_date3) as Tran_Ts3  
  from sample_data
"""

# new dataframe w/results
df2 = spark.sql(stmt)

# show schema
df2.printSchema()

enter image description here

I totally agree with qaziqarta.

Once the data is a timestamp, you can format it any which way you want either in the front end reporting tool (Power BI) or convert it to a formatted string in the curated zone as a file ready formatted for reporting.

See link for Spark SQL Function.

https://spark.apache.org/docs/2.3.0/api/sql/index.html#date_format

See link for valid date format strings.

https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

The code below formats the timestamp in 3 different forms.

#
# Convert data using PySpark SQL
#

# sql stmt
stmt = """
  select
    Tran_id,
    date_format(to_timestamp(Tran_date1), "yyyy-MM-dd") as Tran_Fmt1,
    date_format(to_timestamp(Tran_date2), "yyyy-MM-dd hh:mm:ss") as Tran_Fmt2,
    date_format(to_timestamp(Tran_date3), "yyyy-MM-dd hh:mm:ss.SSS")  as Tran_Fmt3  
  from sample_data
"""

# new dataframe w/results
df3 = spark.sql(stmt)

# show schema
df3.printSchema()

Please note, we are now going back to a string format.

enter image description here

Here is execution and output of the above SQL to obtain formatted strings representing date, date/timestamp and date/extended timestamp.

enter image description here

Upvotes: 1

qaziqarta
qaziqarta

Reputation: 1944

You could have read your csv file with automatic conversion to the required type, like this:

spark.read.option("header","true").option("inferSchema", "true").csv("test.csv")
df.printSchema()
root
 |-- Tran_id: integer (nullable = true)
 |-- Tran_date1: string (nullable = true)
 |-- Tran_date2: string (nullable = true)
 |-- Tran_date3: string (nullable = true)

But as you can see with your data it won't give you the correct schema. The reason is that you have some bad data in your csv. E.g. look at your record #4: 022-09-02T19:23:90.322Z, i.e. there can't be 90 seconds.

You can do the parsing yourself:

df = (
    spark.read
    .option("header","true")
    .csv("test.csv")
    .select(
    "Tran_id",
    F.to_timestamp("Tran_date1").alias("Tran_date1"),
    F.to_timestamp("Tran_date2").alias("Tran_date2"),
    F.to_timestamp("Tran_date3").alias("Tran_date3")))

# Schema is correct now
df.printSchema()
root
 |-- Tran_id: string (nullable = true)
 |-- Tran_date1: timestamp (nullable = true)
 |-- Tran_date2: timestamp (nullable = true)
 |-- Tran_date3: timestamp (nullable = true)

# But we now have nulls for bad data
df.show(truncate=False)
+-------+-----------------------+-----------------------+-----------------------+
|Tran_id|Tran_date1             |Tran_date2             |Tran_date3             |
+-------+-----------------------+-----------------------+-----------------------+
|1      |2022-07-02 16:53:30.375|2022-07-02 16:53:30.375|2022-07-02 16:53:30.375|
|2      |2022-08-02 17:33:10.456|2022-08-02 17:33:10.456|2022-08-02 17:33:10.456|
|3      |2022-09-02 18:13:20.375|2022-09-02 18:13:20.375|2022-09-02 18:13:20.375|
|4      |null                   |null                   |null                   |
+-------+-----------------------+-----------------------+-----------------------+

With the correct schema in place, you can later save you dataframe to hive, spark will take care of preserving the types.

Upvotes: 1

Related Questions