Multiple Character Comment String Spark CSV Reader

I have a tab delimited file that has comments denoted by ##. I would like to read the file into a DataFrame, and want to use something like:

val targetDF = sparkSession.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .option("delimiter", "\t")
  .option("comment", "##")
  .load(pathToFile)

When I try this I get a runtime exception: java.lang.RuntimeException: comment cannot be more than one character. Best way to deal with this?

Upvotes: 1

Answers (1)

cheseaux

Reputation: 5315

Then use just a single '#', each line starting with '#' will be considered as a comment. This is what the API says :

comment (default empty string): sets the single character used for skipping lines beginning with this character. By default, it is disabled.

But be sure that no valid line starts with this character in your file.

val targetDF = sparkSession.read.format("csv")
 .option("header", "true")
 .option("inferSchema", "true")
 .option("delimiter", "\t")
 .option("comment", "#")
 .load(pathToFile)

Edit : because your records can contain a single '#' you'll have to omit the comment option and just filter manually your Dataframe afterwards or remove any line starting with two '#' in your file before parsing it.

Upvotes: 2

Multiple Character Comment String Spark CSV Reader

Answers (1)

Related Questions