Reputation: 1
I would like to create a DataFrame from many small files located in the same directory. I plan to use read.csv
from pyspark.sql. I've learned that in RDD world, textFile
function is designed for reading small number of large files, whereas wholeTextFiles
function is designed for reading a large number of small files (e.g. see this thread). Does read.csv
use textFile
or wholeTextFiles
under the hood?
Upvotes: 0
Views: 720
Reputation: 41987
Yes thats possible, just give the path until the parent directory as
df = spark.read.csv('path until the parent directory where the files are located')
And you should get all the files read into one dataframe. If the files doesn't have the same number of csv rows then the number of columns is the one from the file which as the maximumn number of fields in a line.
Upvotes: 1