Reputation: 51
I'm trying to read some tables from a SQLite db file to work with them with PySpark. I'm using the following code, and it works well with the tables which do not contain a date.
df = spark.read.format('jdbc') \
.options(driver='org.sqlite.JDBC', dbtable='table_name',
url='jdbc:sqlite:/path/to/database.db')\
.load()
df.createOrReplaceTempView("table_name")
spark.sql("select * from table_name").show(n=5)
df.printSchema()
//Exit
root
|-- id: integer (nullable = true)
|-- initial_date: date (nullable = true)
|-- final_date: date (nullable = true)
|-- ref_id: string (nullable = true)
The problem comes when a table contains a date. I get a parsing date error when trying to execute spark.sql("select * from table_name").show(n=5)
Caused by: java.sql.SQLException: Error parsing date
Caused by: java.text.ParseException: Unparseable date: "2019-12-18" does not match (\p{Nd}++)\Q-\E(\p{Nd}++)\Q-\E(\p{Nd}++)\Q \E(\p{Nd}++)\Q:\E(\p{Nd}++)\Q:\E(\p{Nd}++)\Q.\E(\p{Nd}++)
¿How can I parse the date to a valid format for PySpark?
Upvotes: 0
Views: 576
Reputation: 51
Finally the solution that worked is to use the option customSchema to read the fields as String.
df = spark.read.format('jdbc') \
.options(driver='org.sqlite.JDBC', dbtable='table_name',
url='jdbc:sqlite:/path/to/database.db')\
.option("customSchema", "column1 STRING, column2 STRING")\
.load()
Upvotes: 1