Andrew St P
Andrew St P

Reputation: 584

Is there a way to programmatically set a dataset's schema from a .csv

As an example, I have a .csv which uses the Excel dialect which uses something like Python's csv module doubleQuote to escape quotes.

For example, consider the row below:

"XX ""YYYYYYYY"", ZZZZZZ ""QQQQQQ""","JJJJ ""MMMM"", RRRR ""TTTT""",1234,RRRR,60,50

I would want the schema to then become:

 [
   'XX "YYYYYYYY", ZZZZZZ "QQQQQQ"',
   'JJJJ "MMMM", RRRR "TTTT"',
   1234,
   'RRRR',
   60,
   50
 ]

Is there a way to set the schema of a dataset in a programmatic/automated fashion?

Upvotes: 1

Views: 382

Answers (2)

Jonathan Ringstad
Jonathan Ringstad

Reputation: 967

While you can do this in code, foundrys dataset-app can also do this natively. This means you can skip writing the code (which is nice) but also means you can potentially save a step in your pipeline (which might save you on runtime.)

After uploading the files to a dataset, press "edit schema" on the dataset: edit schema button

Then apply settings like the following, which would result in the desired outcome in your case:

schema editor

Then press "save and validate" and the dataset should end up with the correct schema:

final dataset

Upvotes: 3

Adil B
Adil B

Reputation: 16778

Starting with this example:

Dataset<Row> dataset = files
        .sparkSession()
        .read()
        .option("inferSchema", "true")
        .csv(csvDataset);

output.getDataFrameWriter(dataset).write();

Add the header, quote, and escape options, like so:

Dataset<Row> dataset = files
        .sparkSession()
        .read()
        .option("inferSchema", "true")
        .option("header", "true")
        .option("quote", "\"")
        .option("escape", "\"")
        .csv(csvDataset);

output.getDataFrameWriter(dataset).write();  

Upvotes: 0

Related Questions