Is there a way to programmatically set a dataset's schema from a .csv

As an example, I have a .csv which uses the Excel dialect which uses something like Python's csv module doubleQuote to escape quotes.

For example, consider the row below:

"XX ""YYYYYYYY"", ZZZZZZ ""QQQQQQ""","JJJJ ""MMMM"", RRRR ""TTTT""",1234,RRRR,60,50

I would want the schema to then become:

 [
   'XX "YYYYYYYY", ZZZZZZ "QQQQQQ"',
   'JJJJ "MMMM", RRRR "TTTT"',
   1234,
   'RRRR',
   60,
   50
 ]

Is there a way to set the schema of a dataset in a programmatic/automated fashion?

Upvotes: 1

Answers (2)

Jonathan Ringstad

Reputation: 967

While you can do this in code, foundrys dataset-app can also do this natively. This means you can skip writing the code (which is nice) but also means you can potentially save a step in your pipeline (which might save you on runtime.)

After uploading the files to a dataset, press "edit schema" on the dataset:

Then apply settings like the following, which would result in the desired outcome in your case:

Then press "save and validate" and the dataset should end up with the correct schema:

Upvotes: 3

Adil B

Reputation: 16856

Starting with this example:

Dataset<Row> dataset = files
        .sparkSession()
        .read()
        .option("inferSchema", "true")
        .csv(csvDataset);

output.getDataFrameWriter(dataset).write();

Add the header, quote, and escape options, like so:

Dataset<Row> dataset = files
        .sparkSession()
        .read()
        .option("inferSchema", "true")
        .option("header", "true")
        .option("quote", "\"")
        .option("escape", "\"")
        .csv(csvDataset);

output.getDataFrameWriter(dataset).write();

Upvotes: 0

Is there a way to programmatically set a dataset&#39;s schema from a .csv

Answers (2)

Related Questions

Is there a way to programmatically set a dataset's schema from a .csv