Rochit Aggarwal
Rochit Aggarwal

Reputation: 3

How to add double quotes to all non-null values and also not on headings in spark Java

I want to write the dataset in .csv file using spark java and the data of dataset should be in double quotes which are not null and also the headings in file which are actually field names of model class should not contain double quotes and values in file are separted by pipe delimiter(|)

example -> data should be appeared to file in below given format

NAME|ID|MOBILE_NUMBER|DEGREE|SALUTATION|LASTNAME|MAIL|DATE_OF_BIRTH "SMITH"|"56354"|"46535353"||"MR LOG"|"LOG"||"2013-10-24T11:04:52+00:00"

1st line is Heading which are field names of class and 2nd line is data

Here data of DEGREE and MAIL fields are null so they should not be double quoted also headings should not be double quoted

I tried this but not working in case of null values and headings for double quotes

   dataset.toDF().selectExpr(NAME,ID,MOBILE,DEGREE,SALUTATION,LASTNAME,MAIL,DATE_OF_BIRTH)
        .write().format("csv")
        .option("delimiter", "|"),
        .option("quotesAll",true),
        .option("header", true)
        .option("nullValue", null)
        .mode(SaveMode.Overwrite)
        .save(path);

Upvotes: 0

Views: 833

Answers (1)

Belwal
Belwal

Reputation: 473

Option quoteAll will quote everything (including header). Class CSVOptions has no such option for custom formatting.

You can write a custom expression to quote not-null values and then save it.

Scala:

val quote = lit("\"")

  //quote value if not null otherwise return.
  def quoteColumn(c: Column) = when(c.isNotNull, concat(quote, c.cast(StringType), quote )) otherwise c.cast(StringType)

  //call quoteColumn to add quote in all column values in the DataFrame
  val resDF = df.select(df.columns.map(a => quoteColumn(col(a)).as(a)):_*)


  resDF.write()
    .format("csv")
    .option("delimiter", "|")
    .option("header", true)
    //set quote to empty, no need to set null
    .option("quote", "")
    .mode(SaveMode.Overwrite)
    .save(path);

Java:

   private Column quoteColumn(Column c) {
        Column quote = lit("\"");
        return when(c.isNotNull(), concat(quote, c.cast(StringType), quote)).otherwise(c.cast(StringType));
    }
   Dataset<Row> resDF = df.select(Arrays.stream(df.columns()).map(a -> quoteColumn(col(a)).as(a)).toArray(Column[]::new));

   resDF.write()
      .format("csv")
      .option("delimiter", "|")
      .option("header", true)
      // set quote to empty
      .option("quote", "")
      // set null for null values
      .option("nullValue", null)
      .mode(SaveMode.Overwrite)
      .save(path);

Upvotes: 1

Related Questions