Daniel Mahler
Daniel Mahler

Reputation: 8193

Incrementally add data to Parquet tables in S3

I would like to keep a copy of my log data in in Parquet on S3 for ad hoc analytics. I mainly work with Parquet through Spark and that only seems to offer operations to read and write whole tables via SQLContext.parquetFile() and SQLContext.saveAsParquetFile().

Is there any way to add data to and existing Parquet table without writing a whole new copy of it particularly when it is stored in S3?

I know I can create separate tables for the updates and in Spark I can form the union of the corresponig DataFrames in Spark at query time but I have my doubts about the scalability of that.

I can use something other than Spark if needed.

Upvotes: 3

Views: 8061

Answers (2)

TomTom101
TomTom101

Reputation: 6892

The way to append to a parquet file is using SaveMode.Append

`yourDataFrame.write.mode(SaveMode.Append).parquet("/your/file")`

Upvotes: 4

yjshen
yjshen

Reputation: 6693

You don't need to union DataFrames after creating them separately, just supply all the paths related to your query to the parquetFile(paths) and get one DataFrame. Just as the signature of reading parquet file: sqlContext.parquetFile(paths: String*) suggests.

Under the hood, in newParquetRelation2, all the .parquet files from all the folders you supply, as well as all the _common_medata and _metadata would be filled into a single list and regard equally.

Upvotes: 2

Related Questions