Merge multiple JSON file to single JSON and parquet file

Question

Source S3 location with 100s of JSON

All JSON files needs to be combined to single JSON file. i.e. non part-0000... files
The output single JSON file need to replace all these file on source S3 location
Same JSON file needs to converted Parquet and saved to other S3 location

Is there any best option apart from below,

Load the JSON file into Dataframe
Save it on local disk
Upload the combined JSON file to S3
Clean rest of the S3 files after the combined S3 file is uploaded successfully using AWS SDK Client API
This run in parallel to 4. save the parquet file to parquet S3 location via dataframes API

I had below question on above design

Is there any more robust way of doing it ?
Can I read from and write to same S3 location and skip step no. 2.

QuickSilver · Accepted Answer

spark.read
                  .json(sourcePath)
                  .coalesce(1)
                  .write
                  .mode(SaveMode.Overwrite)
                  .json(tempTarget1)

                val fs = FileSystem.get(new URI(s"s3a://$bucketName"), sc.hadoopConfiguration)

                val deleted = fs
                  .delete(new Path(sourcePath + File.separator), true)
                logger.info(s"S3 folder path deleted=${deleted} sparkUuid=$sparkUuid path=${sourcePath}")

                val renamed = fs
                  .rename(new Path(tempTarget1),new Path(sourcePath))

Tried and failed,

Dataframe caching/persist did not work as whenever I tried to write the cachedDf.write went back to check the S3 file which were manually cleaned by me before write.
Writing Dataframe directly to same S3 directory does not work as Dataframe only overrides the file which are partitioned i.e. file starting with 'part-00...'.

Merge multiple JSON file to single JSON and parquet file

Answers (2)

UPDATE : Reg. File Not Found Exception you are facing... see below code example of how to do it. I quoted the same example you showed me here

Related Questions