Spark CSV package not able to handle \n within fields

I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have \n within them for e.g. the following two rows

"XYZ", "Test Data", "TestNew\nline", "OtherData" 
"XYZ", "Test Data", "blablablabla
\nblablablablablalbal", "OtherData"

I am using the following code which is straightforward I am using parserLib as univocity as read in internet it solves multiple newline problem but it does not seems to be the case for me.

 SQLContext sqlContext = new SQLContext(sc);
    DataFrame df = sqlContext.read()
        .format("com.databricks.spark.csv")
        .option("inferSchema", "true")
        .option("header", "true")
        .option("parserLib","univocity")
        .load("data.csv");

How do I replace newline within fields which starts with quotes. Is there any easier way?

Upvotes: 4

Answers (3)

Tegan Snyder

Reputation: 785

There is an option available to users of Spark 2.2 to account for line breaks in CSV files. It was originally discussed as being called wholeFile but prior to release was renamed multiLine.

Here is an example of loading in a CSV to a dataframe with that option:

var webtrends_data = (sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("multiLine", true)
.option("delimiter", ",")
.format("csv")
.load("hdfs://hadoop-master:9000/datasource/myfile.csv"))

Upvotes: 7

Apurva Singh

Reputation: 5000

Upgrade to Spark 2.x. Newline is actually CRLF represented by ascii 13 and 10. But backslash and 'n' are different ascii which are programatically interpreted and written. Spark 2.x will read correctly.. I tried it..s.b.
val conf = new SparkConf().setAppName("HelloSpark").setMaster("local[2]") val sc = SparkSession.builder().master("local").getOrCreate() val df = sc.read.csv("src/main/resources/data.csv") df.foreach(row => println(row.mkString(", ")))
If you cant upgrade, then do a cleanup of \n on RDD with regex. This wont remove end of line since it is $ in regex. S.b.

  val conf = new SparkConf().setAppName("HelloSpark").setMaster("local")
  val sc = new SparkContext(conf)
  val rdd1 = sc.textFile("src/main/resources/data.csv")
  val rdd2 = rdd1.map(row => row.replace("\\n", ""))
  val sqlContext = new SQLContext(sc)

  import sqlContext.implicits._
  val df = rdd2.toDF()
  df.foreach(row => println(row.mkString(", ")))

Upvotes: 1

Jacek Laskowski

Reputation: 74669

According to SPARK-14194 (resolved as a duplicate) fields with new line characters are not supported and will never be.

I proposed to solve this via wholeFile option and it seems merged. I am resolving this as a duplicate of that as that one has a PR.

That's however Spark 2.0, and you use spark-csv module.

In the referenced SPARK-19610 it was fixed with the pull request:

hmm, I understand the motivation for this, though my understanding with csv generally either avoid having newline in field or some implementation would require quotes around field value with newline

In other words, use wholeFile option in Spark 2.x (as you can see in CSVDataSource).

As to spark-csv, this comment might be of some help (highlighting mine):

However, that there are a quite bit of similar JIRAs complaining about this and the original CSV datasource tried to support this although that was incorrectly implemented. This tries to match it with JSON one at least and it might be better to provide a way to process such CSV files. Actually, current implementation requires quotes :). (It was told R supports this case too actually).

In spark-csv's Features you can find the following:

The package also supports saving simple (non-nested) DataFrame. When writing files the API accepts several options:

quote: by default the quote character is ", but can be set to any character. This is written according to quoteMode.

quoteMode: when to quote fields (ALL, MINIMAL (default), NON_NUMERIC, NONE), see Quote Modes

Upvotes: 5

Spark CSV package not able to handle \n within fields

Answers (3)

Related Questions