Daniel
Daniel

Reputation: 51

Error parsing date from SQLite with PySpark

I'm trying to read some tables from a SQLite db file to work with them with PySpark. I'm using the following code, and it works well with the tables which do not contain a date.

df = spark.read.format('jdbc') \
        .options(driver='org.sqlite.JDBC', dbtable='table_name',
                 url='jdbc:sqlite:/path/to/database.db')\
        .load()

df.createOrReplaceTempView("table_name")
spark.sql("select * from table_name").show(n=5)
df.printSchema()

//Exit
root
 |-- id: integer (nullable = true)
 |-- initial_date: date (nullable = true)
 |-- final_date: date (nullable = true)
 |-- ref_id: string (nullable = true)

The problem comes when a table contains a date. I get a parsing date error when trying to execute spark.sql("select * from table_name").show(n=5)

Caused by: java.sql.SQLException: Error parsing date

Caused by: java.text.ParseException: Unparseable date: "2019-12-18" does not match (\p{Nd}++)\Q-\E(\p{Nd}++)\Q-\E(\p{Nd}++)\Q \E(\p{Nd}++)\Q:\E(\p{Nd}++)\Q:\E(\p{Nd}++)\Q.\E(\p{Nd}++)

¿How can I parse the date to a valid format for PySpark?

Upvotes: 0

Views: 576

Answers (1)

Daniel
Daniel

Reputation: 51

Finally the solution that worked is to use the option customSchema to read the fields as String.

df = spark.read.format('jdbc') \
        .options(driver='org.sqlite.JDBC', dbtable='table_name',
                 url='jdbc:sqlite:/path/to/database.db')\
        .option("customSchema", "column1 STRING, column2 STRING")\
        .load()

Upvotes: 1

Related Questions