Spark re partition logic for databricks scalable cluster

Question

Databricks spark cluster can auto-scale as per load.

I am reading gzip files in spark and doing repartitoning on the rdd to get parallelism as for gzip file it will be read on signle core and generate rdd with one partition.

As per this post ideal number of partitions is the number of cores in the cluster which I can set during repartitioning but in case of auto-scale cluster this number will vary as per the state of cluster and how many executors are there in it.

So, What should be the partitioning logic for an auto scalable spark cluster?

EDIT 1:

The folder is ever growing, gzip files keep coming periodically in it, the size of gzip file is ~10GB & uncompressed size is ~150GB. I know that multiple files can be read in parallel. But for a single super large file databricks may try to auto scale the cluster however even though after scaling the cores in cluster have increased, my dataframe would have less number of partitions (based on previous state of cluster where it may be having lesser cores).

Even though my cluster will auto scale(scale out), the processing will be limited to number of partitions which I do by

num_partitions = 
df.repartition(num_partitions)

Douglas M · Accepted Answer

A standard gzip file is not splittable, so Spark will handle the gzip file with just a single core, a single task, no matter what your settings are [As of Spark 2.4.5/3.0]. Hopefully the world will move to bzip2 or other splittable compression techniques when creating large files.

If you directly write the data out to Parquet, you will end up with a single, splittable parquet file. This will be written out by a single core. If stuck with the default gzip codec, would be better to re-partition after the read, and write out multiple parquet files.

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType
schema = StructType([
  StructField("a",IntegerType(),True),
  StructField("b",DoubleType(),True),
  StructField("c",DoubleType(),True)])

input_path = "s3a://mybucket/2G_large_csv_gzipped/onebillionrows.csv.gz"

spark.conf.set('spark.sql.files.maxPartitionBytes', 1000 * (1024 ** 2))
df_two = spark.read.format("csv").schema(schema).load(input_path)
df_two.repartition(32).write.format("parquet").mode("overwrite").save("dbfs:/tmp/spark_gunzip_default_remove_me")

I very recently found, and initial tests are very promising, a splittable gzip codec. This codec actually reads the file multiple times, and each task scans ahead by some number of bytes (w/o decompressing) then starts the decompression. The benefits of this pay off when it comes time to write the dataframe out as a parquet file. You will end up with multiple files, all written in parallel, for greater throughput and shorter wall clock time (your CPU hours will be higher).

Reference: https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md

My test case:

from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType
schema = StructType([
  StructField("a",IntegerType(),True),
  StructField("b",DoubleType(),True),
  StructField("c",DoubleType(),True)])

input_path = "s3a://mybucket/2G_large_csv_gzipped/onebillionrows.csv.gz"

spark.conf.set('spark.sql.files.maxPartitionBytes', 1000 * (1024 ** 2))
df_gz_codec = (spark.read
               .option('io.compression.codecs', 'nl.basjes.hadoop.io.compress.SplittableGzipCodec')
               .schema(schema)
               .csv(input_path)
               )
df_gz_codec.write.format("parquet").save("dbfs:/tmp/gunzip_to_parquet_remove_me")

Spark re partition logic for databricks scalable cluster

Answers (2)

Related Questions