Reputation: 32548
Looking at documentation of awswrangler.s3.to_csv
or awswrangler.s3.to_parquet
, there is a dataset
parameter.
From testing, it looks like setting dataset=True
allows, among other things, to append new data to an already existing set. It also looks like when dataset=True
, I can't specify the file name and AWS autogenerates the names for the files which are added to the specified path
.
Apart from that, I can't find more information on what dataset
means. Is it just referring to the general concept or is there a specific meaning within the context of AWS? What exactly is dataset
and when should it be set to True
?
Upvotes: 1
Views: 267
Reputation: 200910
The dataset=True
option allows you to store the entire dataset, including all metadata, indexes, etc.
The dataset
parameter documentation:
dataset (bool) – If True store as a dataset instead of ordinary file(s) If True, enable all follow arguments: partition_cols, mode, database, table, description, parameters, columns_comments, concurrent_partitioning, catalog_versioning, projection_enabled, projection_types, projection_ranges, projection_values, projection_intervals, projection_digits, catalog_id, schema_evolution.
Note all those extra things that get saved when you save a dataset. All that information, like columns_comments
, concurrent_partitioning
, projection_values
, will be lost when you save to CSV or Parquet. But on the other hand, those values are probably only useful if you plan to do further manipulation of the data via awswrangler/pandas at some later date.
Also note that if you set dataset=True
you have to give it a file name prefix instead of a single file name, because the output generated will be spread across multiple files.
If you want to use the data in any other tool besides Pandas, such as loading the CSV into Excel, then you most likely want to set dataset=False
and output to a single file.
Upvotes: 1