Pyspark dataframe split alphabetically and write to S3

Question

I am trying to split a huge XML file into small XML files using pyspark. I need the data to be written into buckets alphabetically.

Suppose if the name starts with a then it would be written to an s3 bucket s3://bucket_name/a. If there is no name that starts with b it should still create a folder with name b in the same bucket, that is s3://bucket_name/b

So far the code I have is

charater = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]

for c in charater:
    df1 = df.filter(lower(trim(col('name')).substr(1, 1)) == c)
    df1.write\
     .format("com.databricks.spark.xml")\
     .option("maxRecordsPerFile", 800)\
     .option("rootTag","source")\
     .option("rowTag", "employees")\
     .mode("overwrite")\
     .save(f's3://split-files/{c}')

But this code takes a very long time to finish. Is there a better way to do this using data frames?

Thanks in advance.

User12345 · Accepted Answer

To reduce the time use df.persist() before the for loop as suggested by @Steven

For the small files issue you can use coalesce but this is expensive operation.

for c in charater:
    df1 = df.filter(lower(trim(col('name')).substr(1, 1)) == c)
    df1.coalesce(1).write\
     .format("com.databricks.spark.xml")\
     .option("maxRecordsPerFile", 800)\
     .option("rootTag","source")\
     .option("rowTag", "employees")\
     .mode("overwrite")\
     .save(f's3://split-files/{c}')

This will create only one file in each bucket. you can change the number of files you want by specifying to coalesce function

Pyspark dataframe split alphabetically and write to S3

Answers (2)

Related Questions