Murali Varma
Murali Varma

Reputation: 373

Dynamically alter range of Athena Partition Projection

I am trying to understand how exactly to use Athena Partition Projection to work with the latest rolling dynamic time period of, say, 3 months. As I understand, I need to specify the static date range at the time of table creation. How would I update this date range as time goes on? Do I need to alter the table to keep this up to date?

Example: Data stored as s3://bucket-foo/path-bar/dt=2020-01-01/client=acme with 2 partitions (dt and client) By April 2020, I do not particularly care for January 2020 data.

Is this different from traditional Athena Partitions where I would just alter the table and then drop the old/obsolete partitions? I wonder if I have some of these concepts confused.

Upvotes: 1

Views: 1138

Answers (1)

Murali Varma
Murali Varma

Reputation: 373

It was pretty straightforward after reading the documentation a bunch of times.

Dynamic start and end for date types: "projection.dt.range": "NOW-90DAYS,NOW"

As I always fetch data per individual client (and never a set/range of clients), I could do: "projection.client.type": "injected"

This CloudFormation template worked for me:

FooTable:
    Type: AWS::Glue::Table
    DependsOn: FooDB
    Properties:
      DatabaseName: !Ref FooDB
      CatalogId: !Ref AWS::AccountId
      TableInput:
        Name: FooTable
        Parameters: {
          "classification" : "json",
          "projection.enabled": true,
          "projection.client.type": "injected",
          "projection.dt.type": "date",
          "projection.dt.format": "yyyy-MM-dd",
          "projection.dt.range": "NOW-90DAYS,NOW",
          "projection.dt.interval.unit": "DAYS",
          "projection.dt.interval": 1,
          "storage.location.template": "s3://BUCKET_NAME/clients/client=${client}/dt=${dt}"
        }
        TableType: EXTERNAL_TABLE
        PartitionKeys:
          - Name: client
            Type: string
          - Name: dt
            Type: string
        StorageDescriptor:
          Location: s3://BUCKET_NAME/clients/
          InputFormat: org.apache.hadoop.mapred.TextInputFormat
          OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          SerdeInfo:
            SerializationLibrary: org.openx.data.jsonserde.JsonSerDe
            Parameters: { "serialization.format": 1 }
          Columns:
            - Name: ...
              Type: string
            - Name: ...
              Type: string

Upvotes: 2

Related Questions