SPARK-Read file with multi character delimiter with multiline option

Question

How to read a file which has multi character delimiter with multiline option in spark 3.0.1?

Input file

company||street||city
Test1 company||1st street||city1
Test2 company||2nd street||city2
Test3 company||"3rd
 street"||city3

spark.read
        .option("delimiter", "||")
        .option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "false")
        .csv(transformedFile)

On printing the dataframe it shows total records as 4 instead of 3.

records count :4
+-------------+
|company      |
+-------------+
|Test1 company|
|Test2 company|
|Test3 company|
|street"      |
+-------------+

+-------------+-----------+-----+
|company      |street     |city |
+-------------+-----------+-----+
|Test1 company|1st street |city1|
|Test2 company|2nd street |city2|
|Test3 company|3rd 
street|city3|
+-------------+-----------+-----+

The same works as expected if it is a single character delimiter.

mck · Accepted Answer

You can cache the dataframe to make sure that it's read properly:

val df = spark.read.option("delimiter", "||")
        .option("header", "true")
        .option("multiLine", "true")
        .option("inferSchema", "false")
        .csv(transformedFile)

df.cache

df.select("company").show
+-------------+
|      company|
+-------------+
|Test1 company|
|Test2 company|
|Test3 company|
+-------------+

df.count
// 3

SPARK-Read file with multi character delimiter with multiline option

Answers (1)

Related Questions