Ian Carrick
Ian Carrick

Reputation: 358

PySpark - Create a Dataframe with timestamp column datatype

I want to create a simple dataframe using PySpark in a notebook on Azure Databricks. The dataframe only has 3 columns:

  1. TimePeriod - string
  2. StartTimeStanp - data-type of something like 'timestamp' or a data-type that can hold a timestamp(no date part) in the form 'HH:MM:SS:MI'*
  3. EndTimeStanp - data-type of something like 'timestamp' or a data-type that can hold a timestamp(no date part) in the form 'HH:MM:SS:MI'*

*hours:minutes:seconds:milliseconds, example '15:59:59:59'

Example values for one row:
TimePeriod      = "00:00:00:00 -> 15:59:59:59"
StartTimeStamp  = 00:00:00:00
EndTimeStamp    = 15:59:59:59

Example values for secondone row:
TimePeriod      = "16:00:00:00 -> 18:59:59:59"
StartTimeStamp  = 16:00:00:00
EndTimeStamp    = 18:59:59:59

Other rows not shown

I want the timestamp columns to be stored as a 'timestamp' with milliseconds if possible - not string. The 'TimePeriod' column is fine as a string.

I am new to PySpark and need help are writing code to construct this small dataframe. If the answer can not use a Panadas DF that would be good.

Upvotes: 4

Views: 10170

Answers (1)

Hansanho
Hansanho

Reputation: 293

You have a few Options:

  1. Option: Cast the columns to Timestamp:

      columns = ['TimePeriod', 'StartTimeStamp', 'EndTimeStamp']
      data = [("16:00:00:00 -> 18:59:59:59", '16:00:00.00', 
      '18:59:00.00')]
      df = spark.createDataFrame(data).toDF(*columns)
      columns = ['TimePeriod', 'StartTimeStamp', 'EndTimeStamp']
      df = 
      df.withColumn("StartTimeStamp",df.StartTimeStamp.cast('timestamp'))
      .withColumn('EndTimeStamp',df.EndTimeStamp.cast('timestamp'))
      df
    
      Output: DataFrame[TimePeriod: string, StartTimeStamp: timestamp, 
      EndTimeStamp: timestamp]
    

Option No.2: Convert ur StartTimeStamp and EndTimeStamp from str to timestamp and create the dataFrame with the converted Timestamps.

import time
import datetime
timestamp = datetime.datetime.strptime('16:00:00:00',"%H:%M:%S:%f")
timestamp2 = datetime.datetime.strptime('18:59:59:59',"%H:%M:%S:%f")
columns = ['TimePeriod', 'StartTimeStamp', 'EndTimeStamp']
data = [("16:00:00:00 -> 18:59:59:59", timestamp, timestamp2 )]

dfFromData2 = spark.createDataFrame(data).toDF(*columns)
dfFromData2

Output: DataFrame[TimePeriod: string, StartTimeStamp: timestamp, 
EndTimeStamp: timestamp]

Option No .3: Create a schema and create the DataFrame based on the schema you provided:

from pyspark.sql.types import StructType,StructField, TimestampType
schema = StructType([
  StructField('TimePeriod', StringType(), True),
  StructField('StartTimeStamp', TimestampType(), True),
  StructField('EndTimeStamp', TimestampType(), True)
])
df = spark.createDataFrame(data,schema)
df

Output: DataFrame[TimePeriod: string, StartTimeStamp: timestamp, 
EndTimeStamp: timestamp]

Upvotes: 3

Related Questions