Reputation: 30819
I am writing a Spark job to read the data from json file and write it to parquet file, below is the example code:
DataFrame dataFrame = new DataFrameReader(sqlContext).json(textFile);
dataFrame = dataFrame.withColumn("year", year(to_date(unix_timestamp(dataFrame.col("date"), "YYYY-MM-dd'T'hh:mm:ss.SSS").cast("timestamp"))));
dataFrame = dataFrame.withColumn("month", month(to_date(unix_timestamp(dataFrame.col("date"), "YYYY-MM-dd'T'hh:mm:ss.SSS").cast("timestamp"))));
dataFrame.write().mode(SaveMode.Append).partitionBy("year", "month").parquet("<some_path>");
Json file consists of lots of json records and I want the record to be updated in parquet if it already exists. I have tried Append
mode but it seems to be working on file level rather than record level (i.e. if file already exists, it writes in the end). So, running this job for the same file duplicates the records.
Is there any way we can specify dataframe row id as a unique key and ask spark to update the record if it already exists? All the save modes seem to be checking the files and not the records.
Upvotes: 3
Views: 10278
Reputation: 21
You can also look at apache hudi (https://hudi.apache.org/) which provides support for updates over parquet files.
Upvotes: 0
Reputation: 22671
You can have a look at Apache ORC file format instead, see:
https://orc.apache.org/docs/acid.html
According your use case, or HBase if you want stay in top of HDFS.
But keep in mind that HDFS is a write once file system, if this is not fitting your need, go for something else (maybe elasticsearch, mongodb).
Else, in HDFS, you must create new files every-time, you must setup an incremental process to build a "delta" file, then merge OLD + DELTA = NEW_DATA.
Upvotes: 0
Reputation: 2345
Parquet is a file format rather than a database, in order to achieve an update by id, you will need to read the file, update the value in memory, than re-write the data to a new file (or overwrite the existing file).
You might be better served using a database if this is a use-case that will occur frequently.
Upvotes: 1