Reputation: 8193
I would like to keep a copy of my log data in in Parquet on S3 for ad hoc analytics. I mainly work with Parquet through Spark and that only seems to offer operations to read and write whole tables via SQLContext.parquetFile()
and SQLContext.saveAsParquetFile()
.
Is there any way to add data to and existing Parquet table without writing a whole new copy of it particularly when it is stored in S3?
I know I can create separate tables for the updates and in Spark I can form the union of the corresponig DataFrames in Spark at query time but I have my doubts about the scalability of that.
I can use something other than Spark if needed.
Upvotes: 3
Views: 8061
Reputation: 6892
The way to append to a parquet file is using SaveMode.Append
`yourDataFrame.write.mode(SaveMode.Append).parquet("/your/file")`
Upvotes: 4
Reputation: 6693
You don't need to union DataFrames
after creating them separately, just supply all the paths related to your query to the parquetFile(paths)
and get one DataFrame
. Just as the signature of reading parquet file: sqlContext.parquetFile(paths: String*)
suggests.
Under the hood, in newParquetRelation2
, all the .parquet
files from all the folders you supply, as well as all the _common_medata
and _metadata
would be filled into a single list and regard equally.
Upvotes: 2