is there a limit for pyspark read csv files?

Question

i am relatively new to spark/pyspark so any help is well appreciated.

currently we have files being delivered to Azure data lake hourly into a file directory, example:

hour1.csv hour2.csv hour3.csv

i am using databricks to read the files in the file directory using the code below:

sparkdf = spark.read.format(csv).option("recursiveFileLookup", "true").option("header", "true").schema(schema).load(file_location)

each of the CSV files is about 5kb and all have the same schema.

what i am unsure about is how scalable "spark.read" is? currently we are processing about 2000 of such small files, i am worried that there is a limit on the number of files being processed. is there a limit such as maximum 5000 files and my code above breaks?

from what i have read online, i believe data size is not a issue with the method above, spark can read petabytes worth of data(comparatively, our data size in total is still very small), but there are no mentions of the number of files that it is able to process - educate me if i am wrong.

any explanations is very much appreciated.

thank you

is there a limit for pyspark read csv files?

Answers (1)

Related Questions