nsc060
nsc060

Reputation: 447

PySpark - partitionBy to S3 handle special character

I have a column called target_col_a in my dataframe with Timestamp value which have been casted to String e.g. 2020-05-27 08:00:00.

I then partitionBy this column as per below.

target_dataset \
    .write.mode('overwrite') \
    .format('parquet') \
    .partitionBy('target_col_a') \
    .save('s3://my-bucket/my-path')

However, my s3 path turns out like s3://my-bucket/my-path/target_col_a=2020-05-27 08%3A00%3A00/part-0-file1.snappy.parquet

Is there a way to output the partition without the %3A and retain :?

Note: when I use Glue native DynamicFrame to write to S3 or Redshift UNLOAD to S3 the partitioning comes as desired (without the %3A and with :) e.g.

glueContext.write_dynamic_frame.from_options(
    frame = target_dataset,
    connection_type = "s3",
    connection_options = {
        "path": "s3://my-bucket/my-path/",
        "partitionKeys": ["target_col_a"]},
    format = "parquet",
    transformation_ctx = "datasink2"
)

Upvotes: 4

Views: 1622

Answers (2)

QuickSilver
QuickSilver

Reputation: 4045

Specially characters like spaces and : cannot be part of any S3 URI. Even if some how manage to create one you would face difficulties later on every time you use them.

Better to replace these character with URI acceptable ones.

You should follow the key name convention described in this paragraph called Object Key Guidelines of Amazon S3.

The following character sets are generally safe for use in key names:

Alphanumeric characters [0-9a-zA-Z]

Special characters !, -, _, ., *, ', (, and )

Upvotes: 1

Dave
Dave

Reputation: 2049

The short answer is no, you can't.

Pyspark uses hadoop client libraries for input and output. These libraries create paths using the Java URI package. Spaces and colons are not valid URI characters, so they're URL encoded before writing. Pyspark will handle the decoding automatically when the dataset is read, but if you want to access the datasets outside of Spark or Hadoop, you'll need to URL decode the column values.

Upvotes: 1

Related Questions