Reputation: 181
I am trying to convert a large parquet file into CSV.Since my RAM is only 8 GB, i get memory error.So is there any way to read parquet into multiple dataframes over a loop?
Upvotes: 4
Views: 6170
Reputation: 139162
You could do this with dask
(https://dask.org/), which can work with larger than memory data on your local machine.
Example code to read a parquet file and save again as CSV:
import dask.dataframe as dd
df = dd.read_parquet('path/to/file.parquet')
df.to_csv('path/to/new_files-*.csv')
This will create a collection of CSV files (https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.to_csv).
If you need a single CSV file, see this answer to do that: Writing Dask partitions into single file (eg by concatenating them afterwards).
Upvotes: 1
Reputation: 1698
from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder.master('local').appName('myAppName') \
.config('spark.executor.memory', '4gb').config("spark.cores.max", "6").getOrCreate()
sc = spark.sparkContext
# Use SQLContext to read parquet file
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
# Read parquet file
df = sqlContext.read.parquet('ParquetFile.parquet')
I have increased the memory and cores Here. Please try the same and later you can convert to into csv.
Upvotes: 1