Jay
Jay

Reputation: 326

reading too many small files in pyspark taking so much of time

I have written pyspark job to load files present in s3 bucket . In s3 there are too many small files , I am reading file one by one in spark . I am reading a file one one by one as I am adding one column that column has value of bucket path where file is present . Due to this spark job is spending so much of time as it is busy iterating file one by one .

below is code for that :

for filepathins3 in awsfilepathlist:
    data = spark.read.format("parquet").load(filepathins3) \
                    .withColumn("path_s3", lit(filepathins3))

above code is taking so much of time as it is spending much of time reading file one by one , If I provide list of file's path then spark job finishes quickly , but with this approach I can not add column that has filepath as value in the data-frame .

is there way to solve above problem in pyspark job only , rather than creating a separate program to read files and then club and load into spark .

Upvotes: 0

Views: 1133

Answers (2)

jayrythium
jayrythium

Reputation: 767

If the goal is to get the file path, Spark already has a function input_file_name() :

from pyspark.sql.functions import *

data = spark.read.parquet('s3path').withColumn("input_file", input_file_name()) 

Upvotes: 2

code.gsoni
code.gsoni

Reputation: 695

You can simply do

spark.read.parquet(*awsfilepathlist)

Upvotes: 1

Related Questions