dgp
dgp

Reputation: 1

Spark: reading many files with read.csv

I would like to create a DataFrame from many small files located in the same directory. I plan to use read.csv from pyspark.sql. I've learned that in RDD world, textFile function is designed for reading small number of large files, whereas wholeTextFiles function is designed for reading a large number of small files (e.g. see this thread). Does read.csv use textFile or wholeTextFiles under the hood?

Upvotes: 0

Views: 720

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

Yes thats possible, just give the path until the parent directory as

df = spark.read.csv('path until the parent directory where the files are located')

And you should get all the files read into one dataframe. If the files doesn't have the same number of csv rows then the number of columns is the one from the file which as the maximumn number of fields in a line.

Upvotes: 1

Related Questions