How to read and write DataFrame from Spark

Question

I need to save DataFrame in CSV or parquet format (as a single file) and then open it again. The amount of data will not exceed 60Mb, so a single file is reasonable solution. This simple task provides me a lot of headache... This is what I tried:

To read the file if it exists:

df = sqlContext
               .read.parquet("s3n://bucket/myTest.parquet")
               .toDF("key", "value", "date", "qty")

To write the file:

df.write.parquet("s3n://bucket/myTest.parquet")

This does not work because:

1) write creates the folder myTest.parquet with hadoopish files that later I cannot read with .read.parquet("s3n://bucket/myTest.parquet"). In fact I don't care about multiple hadoopish files, unless I can later read them easily into DataFrame. Is it possible?

2) I am always working with the same file myTest.parquet that I am updating and overwriting in S3. It tells me that the file cannot be saved because it already exists.

So, can someone indicate me a right way to do the read/write loop? The file format doesn't matter for me (csv,parquet,csv,hadoopish files) unleass I can make the read and write loop.

Simon Schiff · Accepted Answer

You can save your DataFrame with saveAsTable("TableName") and read it with table("TableName"). And the location can be set by spark.sql.warehouse.dir. And you can overwrite a file with mode(SaveMode.Ignore). You can read here more from the official documentation.

In Java it would look like this:

SparkSession spark = ...
spark.conf().set("spark.sql.warehouse.dir", "hdfs://localhost:9000/tables");
Dataset data = ...
data.write().mode(SaveMode.Overwrite).saveAsTable("TableName");

Now you can read from the Data with:

spark.read().table("TableName");

How to read and write DataFrame from Spark

Answers (1)

Related Questions