Chique_Code
Chique_Code

Reputation: 1530

How to append multiple parquet files to one dataframe in Pandas

I am working on decompressing snappy.parquet files with Spark and Pandas. I have 180 files (7GB of data in my Jupyter notebook). In my understanding, I need to create a loop to grab all the files - decompress them with Spark and append to Pandas table? Here is the code

findspark.init()

import pyspark 

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

parquetFile = spark.read.parquet("file_name.snappy.parquet")

parquetFile.createOrReplaceTempView("parquetFile")
file_output = spark.sql("SELECT * FROM parquetFile")
file_output.show()

pandas_df = file_output.select("*").toPandas()

This part works and I have my Pandas dataframe from one file, and I have another 180 files that I need to append to the pandas_df. Can anyone help me out? Thank you!

Upvotes: 1

Views: 5532

Answers (1)

Cesar A. Mostacero
Cesar A. Mostacero

Reputation: 770

With Spark you can load a dataframe from a single file or from multiple files, only you need to replace your path of your single for a path of your folder (assuming that all of your 180 files are in the same directory).

parquetFile = spark.read.parquet("your_dir_path/")

Upvotes: 2

Related Questions