Reputation: 13666
I have a CSV file which I am trying to load using Spark CSV package and it does not load data properly because few of the fields have \n
within them for e.g. the following two rows
"XYZ", "Test Data", "TestNew\nline", "OtherData"
"XYZ", "Test Data", "blablablabla
\nblablablablablalbal", "OtherData"
I am using the following code which is straightforward I am using parserLib
as univocity
as read in internet it solves multiple newline problem but it does not seems to be the case for me.
SQLContext sqlContext = new SQLContext(sc);
DataFrame df = sqlContext.read()
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.option("parserLib","univocity")
.load("data.csv");
How do I replace newline within fields which starts with quotes. Is there any easier way?
Upvotes: 4
Views: 11417
Reputation: 785
There is an option available to users of Spark 2.2 to account for line breaks in CSV files. It was originally discussed as being called wholeFile
but prior to release was renamed multiLine
.
Here is an example of loading in a CSV to a dataframe with that option:
var webtrends_data = (sparkSession.read
.option("header", "true")
.option("inferSchema", "true")
.option("multiLine", true)
.option("delimiter", ",")
.format("csv")
.load("hdfs://hadoop-master:9000/datasource/myfile.csv"))
Upvotes: 7
Reputation: 5000
Upgrade to Spark 2.x. Newline is actually CRLF represented by ascii 13 and 10. But backslash and 'n' are different ascii which are programatically interpreted and written. Spark 2.x will read correctly.. I tried it..s.b.
val conf = new SparkConf().setAppName("HelloSpark").setMaster("local[2]")
val sc = SparkSession.builder().master("local").getOrCreate()
val df = sc.read.csv("src/main/resources/data.csv")
df.foreach(row => println(row.mkString(", ")))
If you cant upgrade, then do a cleanup of \n on RDD with regex. This wont remove end of line since it is $ in regex. S.b.
val conf = new SparkConf().setAppName("HelloSpark").setMaster("local")
val sc = new SparkContext(conf)
val rdd1 = sc.textFile("src/main/resources/data.csv")
val rdd2 = rdd1.map(row => row.replace("\\n", ""))
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = rdd2.toDF()
df.foreach(row => println(row.mkString(", ")))
Upvotes: 1
Reputation: 74669
According to SPARK-14194 (resolved as a duplicate) fields with new line characters are not supported and will never be.
I proposed to solve this via
wholeFile
option and it seems merged. I am resolving this as a duplicate of that as that one has a PR.
That's however Spark 2.0, and you use spark-csv
module.
In the referenced SPARK-19610 it was fixed with the pull request:
hmm, I understand the motivation for this, though my understanding with csv generally either avoid having newline in field or some implementation would require quotes around field value with newline
In other words, use wholeFile
option in Spark 2.x (as you can see in CSVDataSource).
As to spark-csv, this comment might be of some help (highlighting mine):
However, that there are a quite bit of similar JIRAs complaining about this and the original CSV datasource tried to support this although that was incorrectly implemented. This tries to match it with JSON one at least and it might be better to provide a way to process such CSV files. Actually, current implementation requires quotes :). (It was told R supports this case too actually).
In spark-csv's Features you can find the following:
The package also supports saving simple (non-nested) DataFrame. When writing files the API accepts several options:
quote: by default the quote character is
"
, but can be set to any character. This is written according toquoteMode
.quoteMode: when to quote fields (ALL, MINIMAL (default), NON_NUMERIC, NONE), see Quote Modes
Upvotes: 5