PySpark - partitionBy to S3 handle special character

Question

I have a column called target_col_a in my dataframe with Timestamp value which have been casted to String e.g. 2020-05-27 08:00:00.

I then partitionBy this column as per below.

target_dataset \
    .write.mode('overwrite') \
    .format('parquet') \
    .partitionBy('target_col_a') \
    .save('s3://my-bucket/my-path')

However, my s3 path turns out like s3://my-bucket/my-path/target_col_a=2020-05-27 08%3A00%3A00/part-0-file1.snappy.parquet

Is there a way to output the partition without the %3A and retain :?

Note: when I use Glue native DynamicFrame to write to S3 or Redshift UNLOAD to S3 the partitioning comes as desired (without the %3A and with :) e.g.

glueContext.write_dynamic_frame.from_options(
    frame = target_dataset,
    connection_type = "s3",
    connection_options = {
        "path": "s3://my-bucket/my-path/",
        "partitionKeys": ["target_col_a"]},
    format = "parquet",
    transformation_ctx = "datasink2"
)

Dave · Accepted Answer

The short answer is no, you can't.

Pyspark uses hadoop client libraries for input and output. These libraries create paths using the Java URI package. Spaces and colons are not valid URI characters, so they're URL encoded before writing. Pyspark will handle the decoding automatically when the dataset is read, but if you want to access the datasets outside of Spark or Hadoop, you'll need to URL decode the column values.

PySpark - partitionBy to S3 handle special character

Answers (2)

Related Questions