Reputation: 1530
I am working on decompressing snappy.parquet files with Spark and Pandas. I have 180 files (7GB of data in my Jupyter notebook). In my understanding, I need to create a loop to grab all the files - decompress them with Spark and append to Pandas table? Here is the code
findspark.init()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
parquetFile = spark.read.parquet("file_name.snappy.parquet")
parquetFile.createOrReplaceTempView("parquetFile")
file_output = spark.sql("SELECT * FROM parquetFile")
file_output.show()
pandas_df = file_output.select("*").toPandas()
This part works and I have my Pandas dataframe from one file, and I have another 180 files that I need to append to the pandas_df. Can anyone help me out? Thank you!
Upvotes: 1
Views: 5532
Reputation: 770
With Spark you can load
a dataframe
from a single file or from multiple files, only you need to replace your path of your single for a path of your folder (assuming that all of your 180 files are in the same directory).
parquetFile = spark.read.parquet("your_dir_path/")
Upvotes: 2