Reputation: 326
I have written pyspark job to load files present in s3 bucket . In s3 there are too many small files , I am reading file one by one in spark . I am reading a file one one by one as I am adding one column that column has value of bucket path where file is present . Due to this spark job is spending so much of time as it is busy iterating file one by one .
below is code for that :
for filepathins3 in awsfilepathlist:
data = spark.read.format("parquet").load(filepathins3) \
.withColumn("path_s3", lit(filepathins3))
above code is taking so much of time as it is spending much of time reading file one by one , If I provide list of file's path then spark job finishes quickly , but with this approach I can not add column that has filepath as value in the data-frame .
is there way to solve above problem in pyspark job only , rather than creating a separate program to read files and then club and load into spark .
Upvotes: 0
Views: 1133
Reputation: 767
If the goal is to get the file path, Spark already has a function input_file_name()
:
from pyspark.sql.functions import *
data = spark.read.parquet('s3path').withColumn("input_file", input_file_name())
Upvotes: 2