Reputation: 3635
I need to save DataFrame in CSV or parquet format (as a single file) and then open it again. The amount of data will not exceed 60Mb, so a single file is reasonable solution. This simple task provides me a lot of headache... This is what I tried:
To read the file if it exists:
df = sqlContext
.read.parquet("s3n://bucket/myTest.parquet")
.toDF("key", "value", "date", "qty")
To write the file:
df.write.parquet("s3n://bucket/myTest.parquet")
This does not work because:
1) write
creates the folder myTest.parquet
with hadoopish files that later I cannot read with .read.parquet("s3n://bucket/myTest.parquet")
. In fact I don't care about multiple hadoopish files, unless I can later read them easily into DataFrame. Is it possible?
2) I am always working with the same file myTest.parquet
that I am updating and overwriting in S3. It tells me that the file cannot be saved because it already exists.
So, can someone indicate me a right way to do the read/write loop? The file format doesn't matter for me (csv,parquet,csv,hadoopish files) unleass I can make the read and write loop.
Upvotes: 1
Views: 5075
Reputation: 711
You can save your DataFrame with saveAsTable("TableName")
and read it with table("TableName")
. And the location can be set by spark.sql.warehouse.dir
. And you can overwrite a file with mode(SaveMode.Ignore)
. You can read here more from the official documentation.
In Java it would look like this:
SparkSession spark = ...
spark.conf().set("spark.sql.warehouse.dir", "hdfs://localhost:9000/tables");
Dataset<Row> data = ...
data.write().mode(SaveMode.Overwrite).saveAsTable("TableName");
Now you can read from the Data with:
spark.read().table("TableName");
Upvotes: 1