Reputation: 447
I have a column called target_col_a
in my dataframe with Timestamp value which have been casted to String e.g. 2020-05-27 08:00:00
.
I then partitionBy
this column as per below.
target_dataset \
.write.mode('overwrite') \
.format('parquet') \
.partitionBy('target_col_a') \
.save('s3://my-bucket/my-path')
However, my s3 path turns out like s3://my-bucket/my-path/target_col_a=2020-05-27 08%3A00%3A00/part-0-file1.snappy.parquet
Is there a way to output the partition without the %3A
and retain :
?
Note: when I use Glue native DynamicFrame
to write to S3 or Redshift UNLOAD
to S3 the partitioning comes as desired (without the %3A
and with :
) e.g.
glueContext.write_dynamic_frame.from_options(
frame = target_dataset,
connection_type = "s3",
connection_options = {
"path": "s3://my-bucket/my-path/",
"partitionKeys": ["target_col_a"]},
format = "parquet",
transformation_ctx = "datasink2"
)
Upvotes: 4
Views: 1622
Reputation: 4045
Specially characters like spaces and
:
cannot be part of any S3 URI.
Even if some how manage to create one you would face difficulties later on every time you use them.
Better to replace these character with URI acceptable ones.
You should follow the key name convention described in this paragraph called Object Key Guidelines of Amazon S3.
The following character sets are generally safe for use in key names:
Alphanumeric characters [0-9a-zA-Z]
Special characters !, -, _, ., *, ', (, and )
Upvotes: 1
Reputation: 2049
The short answer is no, you can't.
Pyspark uses hadoop client libraries for input and output. These libraries create paths using the Java URI package. Spaces and colons are not valid URI characters, so they're URL encoded before writing. Pyspark will handle the decoding automatically when the dataset is read, but if you want to access the datasets outside of Spark or Hadoop, you'll need to URL decode the column values.
Upvotes: 1