Reputation: 397
I would like to save a huge pyspark dataframe as a Hive table. How can I do this efficiently? I am looking to use saveAsTable(name, format=None, mode=None, partitionBy=None, **options) from pyspark.sql.DataFrameWriter.saveAsTable.
# Let's say I have my dataframe, my_df
# Am I able to do the following?
my_df.saveAsTable('my_table')
My question is which formats are available for me to use and where can I find this information for myself? Is OrcSerDe an option? I am still learning about this. Thank you.
Upvotes: 5
Views: 10827
Reputation: 12900
Following file formats are supported.
Upvotes: 4
Reputation: 397
So I was able to write the pyspark dataframe to a compressed Hive table by using a pyspark.sql.DataFrameWriter. To do this I had to do something like the following:
my_df.write.orc('my_file_path')
That did the trick.
https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.write
I am using pyspark 1.6.0 btw
Upvotes: 1