matthiasdenu
matthiasdenu

Reputation: 397

Which file formats can I save a pyspark dataframe as?

I would like to save a huge pyspark dataframe as a Hive table. How can I do this efficiently? I am looking to use saveAsTable(name, format=None, mode=None, partitionBy=None, **options) from pyspark.sql.DataFrameWriter.saveAsTable.

# Let's say I have my dataframe, my_df
# Am I able to do the following?
my_df.saveAsTable('my_table')

My question is which formats are available for me to use and where can I find this information for myself? Is OrcSerDe an option? I am still learning about this. Thank you.

Upvotes: 5

Views: 10827

Answers (2)

Gaurang Shah
Gaurang Shah

Reputation: 12900

Following file formats are supported.

  • text
  • csv
  • ldap
  • json
  • parquet
  • orc

Referece: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Upvotes: 4

matthiasdenu
matthiasdenu

Reputation: 397

So I was able to write the pyspark dataframe to a compressed Hive table by using a pyspark.sql.DataFrameWriter. To do this I had to do something like the following:

my_df.write.orc('my_file_path')

That did the trick.

https://spark.apache.org/docs/1.6.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.write

I am using pyspark 1.6.0 btw

Upvotes: 1

Related Questions