Reputation: 8628
If I save data frame this way in Java, ...:
df.write().parquet("myTest.parquet");
..., then it gets saved in a hadoopish way (a folder with numerous files).
Is it possible to save data frame as a single file? I tried collect()
, but it does not help.
If it's impossible, then my question is how should I change the Python code for reading Parquet files from hadoopish folder created by df.write().parquet("myTest.parquet")
:
load_df = sqlContext.read.parquet("myTest.parquet").where('field1="aaa"').select('field2', 'field3').coalesce(64)
Upvotes: 0
Views: 842
Reputation: 2424
Spark writes your files in a directory, this files in numerous as you say and if the writing operation success it saves another empty file called _SUCCESS
I'm coming from scala but I do believe that there's a similar way in python
Save and read your files in parquet
or json
or whatever format you want is straightforward :
df.write.parquet("path")
loaddf = spark.read.parquet("path")
I tried collect(), but it does not help.
Talking about collect
, it is not a good practice to use it in such operations because it returns your data to driver so you will lose the parallel computation benefits, and it will cause an OutOfMemoryException
if the data can't fit in the memory
Is it possible to save data frame as a single file?
You really don't need to do that in major cases, if so, use the repartition(1)
method on your Dataframe
before saving it
Hope it helps, Best Regards
Upvotes: 1
Reputation: 74619
Is it possible to save data frame as a single file?
Yes, but you should not as you may put too much pressure on a single JVM that can lead not only to performance degradation but also to JVM termination and hence the entire Spark application failure.
So, yes, it's possible and you should repartition(1)
to have a single partition:
repartition(numPartitions: Int): Dataset[T] Returns a new Dataset that has exactly numPartitions partitions.
how should I change the Python code for reading Parquet files from hadoopish folder
Loading the dataset from as you called it a "hadoopish" folder is to not be concerned with the internal structure at all, and consider it a single file (that is a directory under the covers).
That's an internal representation of how the files is stored and does not impact the code to load it.
Upvotes: 1