Reputation: 137
How to read a file which has multi character delimiter with multiline option in spark 3.0.1?
Input file
company||street||city
Test1 company||1st street||city1
Test2 company||2nd street||city2
Test3 company||"3rd
street"||city3
spark.read
.option("delimiter", "||")
.option("header", "true")
.option("multiLine", "true")
.option("inferSchema", "false")
.csv(transformedFile)
On printing the dataframe it shows total records as 4 instead of 3.
records count :4
+-------------+
|company |
+-------------+
|Test1 company|
|Test2 company|
|Test3 company|
|street" |
+-------------+
+-------------+-----------+-----+
|company |street |city |
+-------------+-----------+-----+
|Test1 company|1st street |city1|
|Test2 company|2nd street |city2|
|Test3 company|3rd
street|city3|
+-------------+-----------+-----+
The same works as expected if it is a single character delimiter.
Upvotes: 1
Views: 553
Reputation: 42422
You can cache the dataframe to make sure that it's read properly:
val df = spark.read.option("delimiter", "||")
.option("header", "true")
.option("multiLine", "true")
.option("inferSchema", "false")
.csv(transformedFile)
df.cache
df.select("company").show
+-------------+
| company|
+-------------+
|Test1 company|
|Test2 company|
|Test3 company|
+-------------+
df.count
// 3
Upvotes: 1