Reputation: 30819

Spark - Update the record (in parquet file) if already exists

I am writing a Spark job to read the data from json file and write it to parquet file, below is the example code:

    DataFrame dataFrame = new DataFrameReader(sqlContext).json(textFile);
    dataFrame = dataFrame.withColumn("year", year(to_date(unix_timestamp(dataFrame.col("date"), "YYYY-MM-dd'T'hh:mm:ss.SSS").cast("timestamp"))));
    dataFrame = dataFrame.withColumn("month", month(to_date(unix_timestamp(dataFrame.col("date"), "YYYY-MM-dd'T'hh:mm:ss.SSS").cast("timestamp"))));
    dataFrame.write().mode(SaveMode.Append).partitionBy("year", "month").parquet("<some_path>");

Json file consists of lots of json records and I want the record to be updated in parquet if it already exists. I have tried Append mode but it seems to be working on file level rather than record level (i.e. if file already exists, it writes in the end). So, running this job for the same file duplicates the records.

Is there any way we can specify dataframe row id as a unique key and ask spark to update the record if it already exists? All the save modes seem to be checking the files and not the records.

Upvotes: 3

Answers (3)

Atzmon Hen-tov

Reputation: 21

You can also look at apache hudi (https://hudi.apache.org/) which provides support for updates over parquet files.

Upvotes: 0

Thomas Decaux

Reputation: 22671

You can have a look at Apache ORC file format instead, see:

https://orc.apache.org/docs/acid.html

According your use case, or HBase if you want stay in top of HDFS.

But keep in mind that HDFS is a write once file system, if this is not fitting your need, go for something else (maybe elasticsearch, mongodb).

Else, in HDFS, you must create new files every-time, you must setup an incremental process to build a "delta" file, then merge OLD + DELTA = NEW_DATA.

Upvotes: 0

ImDarrenG

Reputation: 2345

Parquet is a file format rather than a database, in order to achieve an update by id, you will need to read the file, update the value in memory, than re-write the data to a new file (or overwrite the existing file).

You might be better served using a database if this is a use-case that will occur frequently.

Upvotes: 1

Spark - Update the record (in parquet file) if already exists

Answers (3)

Related Questions