Reputation: 358
I want to create a simple dataframe using PySpark in a notebook on Azure Databricks. The dataframe only has 3 columns:
*hours:minutes:seconds:milliseconds, example '15:59:59:59'
Example values for one row:
TimePeriod = "00:00:00:00 -> 15:59:59:59"
StartTimeStamp = 00:00:00:00
EndTimeStamp = 15:59:59:59
Example values for secondone row:
TimePeriod = "16:00:00:00 -> 18:59:59:59"
StartTimeStamp = 16:00:00:00
EndTimeStamp = 18:59:59:59
Other rows not shown
I want the timestamp columns to be stored as a 'timestamp' with milliseconds if possible - not string. The 'TimePeriod' column is fine as a string.
I am new to PySpark and need help are writing code to construct this small dataframe. If the answer can not use a Panadas DF that would be good.
Upvotes: 4
Views: 10170
Reputation: 293
You have a few Options:
Option: Cast the columns to Timestamp:
columns = ['TimePeriod', 'StartTimeStamp', 'EndTimeStamp']
data = [("16:00:00:00 -> 18:59:59:59", '16:00:00.00',
'18:59:00.00')]
df = spark.createDataFrame(data).toDF(*columns)
columns = ['TimePeriod', 'StartTimeStamp', 'EndTimeStamp']
df =
df.withColumn("StartTimeStamp",df.StartTimeStamp.cast('timestamp'))
.withColumn('EndTimeStamp',df.EndTimeStamp.cast('timestamp'))
df
Output: DataFrame[TimePeriod: string, StartTimeStamp: timestamp,
EndTimeStamp: timestamp]
Option No.2: Convert ur StartTimeStamp and EndTimeStamp from str to timestamp and create the dataFrame with the converted Timestamps.
import time
import datetime
timestamp = datetime.datetime.strptime('16:00:00:00',"%H:%M:%S:%f")
timestamp2 = datetime.datetime.strptime('18:59:59:59',"%H:%M:%S:%f")
columns = ['TimePeriod', 'StartTimeStamp', 'EndTimeStamp']
data = [("16:00:00:00 -> 18:59:59:59", timestamp, timestamp2 )]
dfFromData2 = spark.createDataFrame(data).toDF(*columns)
dfFromData2
Output: DataFrame[TimePeriod: string, StartTimeStamp: timestamp,
EndTimeStamp: timestamp]
Option No .3: Create a schema and create the DataFrame based on the schema you provided:
from pyspark.sql.types import StructType,StructField, TimestampType
schema = StructType([
StructField('TimePeriod', StringType(), True),
StructField('StartTimeStamp', TimestampType(), True),
StructField('EndTimeStamp', TimestampType(), True)
])
df = spark.createDataFrame(data,schema)
df
Output: DataFrame[TimePeriod: string, StartTimeStamp: timestamp,
EndTimeStamp: timestamp]
Upvotes: 3