Pranab
Pranab

Reputation: 41

Reading 100M rows from postgreSQL to Spark and writing to parquet

I am trying to read a big table to spark(~100M rows). The table is in PostGreSQL and we are reading as follows:

val connectionProperties = new Properties()
connectionProperties.put("user", "$USER")
connectionProperties.put("password", "$PASSWORD")

// val connection = DriverManager.getConnection("$SERVER", "$USER", "$PASSWORD")
//connection.isClosed()

val jdbc_url = s"jdbc:postgresql://${"$HOST"}:${$PORT}/${"$DB"}"
val df = spark.read.option("inferSchema", true).jdbc(jdbc_url, "$TABLE", connectionProperties)

But our SQL table has 2 columns as money datatype(in format $100,000.23). When reading in spark, it gets converted into double and throws exception.

We have tried doing : a) Casting column data to Double. But that is not helping as Spark also automatically casts to double. It is having problem with , in values. (used b) The data is already in DataFrame with commas. Trying to use PostgreSQL dialects(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala)

Any help will be really appreciated.

Upvotes: 2

Views: 2336

Answers (1)

addmeaning
addmeaning

Reputation: 1398

You can try to specify schema manually, read the column as a string then manually parse value defining User Defined Function.

To specify schema manually you need write something like this

    val schema =
  StructType(
    StructField("your-example-column1", IntegerType, true) ::
    StructField("your-money-column", StringType, true) :: Nil)
    spark.read.schema(schema)

See Spark Scala API :

To learn more about how to convert StringType to data type you need refer to this question

Upvotes: 2

Related Questions