Nate Parke
Nate Parke

Reputation: 291

Multiple Character Comment String Spark CSV Reader

I have a tab delimited file that has comments denoted by ##. I would like to read the file into a DataFrame, and want to use something like:

val targetDF = sparkSession.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .option("delimiter", "\t")
  .option("comment", "##")
  .load(pathToFile)

When I try this I get a runtime exception: java.lang.RuntimeException: comment cannot be more than one character. Best way to deal with this?

Upvotes: 1

Views: 2441

Answers (1)

cheseaux
cheseaux

Reputation: 5315

Then use just a single '#', each line starting with '#' will be considered as a comment. This is what the API says :

comment (default empty string): sets the single character used for skipping lines beginning with this character. By default, it is disabled.

But be sure that no valid line starts with this character in your file.

val targetDF = sparkSession.read.format("csv")
 .option("header", "true")
 .option("inferSchema", "true")
 .option("delimiter", "\t")
 .option("comment", "#")
 .load(pathToFile)

Edit : because your records can contain a single '#' you'll have to omit the comment option and just filter manually your Dataframe afterwards or remove any line starting with two '#' in your file before parsing it.

Upvotes: 2

Related Questions