Spark - csv read option

Question

I'm using spark 2.1 and tried to read csv file.

compile group: 'org.scala-lang', name: 'scala-library', version: '2.11.1'
compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.1.0'

Here is my code.

import java.io.{BufferedWriter, File, FileWriter}
import java.sql.{Connection, DriverManager}
import net.sf.log4jdbc.sql.jdbcapi.ConnectionSpy
import org.apache.spark.sql.{DataFrame, SparkSession, Column, SQLContext}
import org.apache.spark.sql.functions._
import org.postgresql.jdbc.PgConnection

spark.read
    .option("charset", "utf-8")
    .option("header", "true")
    .option("quote", "\"")
    .option("delimiter", ",")
    .csv(...)

It works well. Problem is that spark read(DataFrameReader) option key is not same as reference (link). reference said I should use 'encoding' for encoding but not working, but charset work well. Is reference is wrong?

soote · Accepted Answer

You can see here:

val charset = parameters.getOrElse("encoding", 
       parameters.getOrElse("charset",StandardCharsets.UTF_8.name()))

Both encoding and charset are valid options, and you should have no problem using either when setting the encoding.

Charset is simply there for legacy support from when the spark csv code was from the databricks spark csv project, which has been merged into the spark project since 2.x. That is also where delimiter (now sep) comes from.

Note the default values for the csv reader, you can remove charset, quote, and delimiter from your code, since you are just using the default values. Leaving you with simply:

spark.read.option("header", "true").csv(...)

Spark - csv read option

Answers (1)

Related Questions