Rahul
Rahul

Reputation: 181

How to read a large parquet file as multiple dataframes?

I am trying to convert a large parquet file into CSV.Since my RAM is only 8 GB, i get memory error.So is there any way to read parquet into multiple dataframes over a loop?

Upvotes: 4

Views: 6170

Answers (2)

joris
joris

Reputation: 139162

You could do this with dask (https://dask.org/), which can work with larger than memory data on your local machine.
Example code to read a parquet file and save again as CSV:

import dask.dataframe as dd

df = dd.read_parquet('path/to/file.parquet')
df.to_csv('path/to/new_files-*.csv')

This will create a collection of CSV files (https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv).
If you need a single CSV file, see this answer to do that: Writing Dask partitions into single file (eg by concatenating them afterwards).

Upvotes: 1

Prathik Kini
Prathik Kini

Reputation: 1698

from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder.master('local').appName('myAppName') \
.config('spark.executor.memory', '4gb').config("spark.cores.max", "6").getOrCreate()

sc = spark.sparkContext

# Use SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# Read parquet file
df = sqlContext.read.parquet('ParquetFile.parquet')

I have increased the memory and cores Here. Please try the same and later you can convert to into csv.

Upvotes: 1

Related Questions