How to append multiple parquet files to one dataframe in Pandas

Question

I am working on decompressing snappy.parquet files with Spark and Pandas. I have 180 files (7GB of data in my Jupyter notebook). In my understanding, I need to create a loop to grab all the files - decompress them with Spark and append to Pandas table? Here is the code

findspark.init()

import pyspark 

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

parquetFile = spark.read.parquet("file_name.snappy.parquet")

parquetFile.createOrReplaceTempView("parquetFile")
file_output = spark.sql("SELECT * FROM parquetFile")
file_output.show()

pandas_df = file_output.select("*").toPandas()

This part works and I have my Pandas dataframe from one file, and I have another 180 files that I need to append to the pandas_df. Can anyone help me out? Thank you!

Cesar A. Mostacero · Accepted Answer

With Spark you can load a dataframe from a single file or from multiple files, only you need to replace your path of your single for a path of your folder (assuming that all of your 180 files are in the same directory).

parquetFile = spark.read.parquet("your_dir_path/")

How to append multiple parquet files to one dataframe in Pandas

Answers (1)

Related Questions