mz_mz
mz_mz

Reputation: 11

is there a limit for pyspark read csv files?

i am relatively new to spark/pyspark so any help is well appreciated.

currently we have files being delivered to Azure data lake hourly into a file directory, example:

hour1.csv hour2.csv hour3.csv

i am using databricks to read the files in the file directory using the code below:

sparkdf = spark.read.format(csv).option("recursiveFileLookup", "true").option("header", "true").schema(schema).load(file_location)

each of the CSV files is about 5kb and all have the same schema.

what i am unsure about is how scalable "spark.read" is? currently we are processing about 2000 of such small files, i am worried that there is a limit on the number of files being processed. is there a limit such as maximum 5000 files and my code above breaks?

from what i have read online, i believe data size is not a issue with the method above, spark can read petabytes worth of data(comparatively, our data size in total is still very small), but there are no mentions of the number of files that it is able to process - educate me if i am wrong.

any explanations is very much appreciated.

thank you

Upvotes: 1

Views: 1279

Answers (1)

Lior Regev
Lior Regev

Reputation: 460

The limit it your driver's memory.

When reading a directory, the driver lists it (depending on the initial size, it may parallelize the listing to executors, but it collects the results either way). After having the list of files, it creates tasks for the executors to run.

With that in mind, if the list is too large to fit in the driver's memory, you will have issues.

You can always increase the driver's memory space to manage it, or have some preprocess to merge the files (GCS has a gsutil compose which can merge files without downloading them).

Upvotes: 1

Related Questions