Wesley Cheek
Wesley Cheek

Reputation: 1696

Decode URL strings with Pydantic

I am using Pydantic to validate and type an incoming S3 Event in an AWS Lambda function.

The event looks like this (only including relevant bits):

{
  "Records": [
    {
      "s3": {
        "bucket": {
          "name": "my-bucket"
        },
        "object": {
          "key": "MYKEY%28CSV%29/XXXX.CSV"
        }
      }
    }
  ]
}

I define my Model like this to get the relevant information.

from pydantic import BaseModel

class ObjectInfo(BaseModel):
    key: str


class BucketInfo(BaseModel):
    name: str


class S3Schema(BaseModel):
    bucket: BucketInfo
    object: ObjectInfo


class Record(BaseModel):
    s3: S3Schema


class DeletionEvent(BaseModel):
    Records: list[Record]

def handler(event: dict, _):
    eventTyped = DeletionEvent(**event)
    return True

Now the problem is that the correct value for key is supposed to be MYKEY(CSV)/XXXX.CSV, not MYKEY%28CSV%29/XXXX.CSV. I usually fix this issue using urllib.parse.unquote_plus to decode the %XX bits representing special characters. I think I can define a custom decoder but this seems like overkill.

Is there any way to get pydantic to do this decoding for me? It has a bunch of classes for working with URLs but I don't see anything about decoding URL encoded strings by themselves.

Upvotes: 0

Views: 1175

Answers (1)

Wesley Cheek
Wesley Cheek

Reputation: 1696

I took my own advice and looked into building a custom decoder. It still feels like Pydantic should have a better way. Here is the solution I've found:

from urllib.parse import unquote
from typing_extensions import Annotated

from pydantic import (
    BaseModel,
    EncodedStr,
    EncoderProtocol
)

# This is the class that will be used to "decode" my URL string
class MyEncoder(EncoderProtocol):
    @classmethod
    def decode(cls, data: bytes) -> bytes:
# We have to use unquote rather than unquote_plus because only unquote can work with bytes objects. 
# This may be a limitation if your URL string contains encoded spaces.
        return str.encode(unquote(data))

MyEncodedStr = Annotated[str, EncodedStr(encoder=MyEncoder)]

class ObjectInfo(BaseModel):
    key: MyEncodedStr


class BucketInfo(BaseModel):
    name: str


class S3Schema(BaseModel):
    bucket: BucketInfo
    object: ObjectInfo


class Record(BaseModel):
    s3: S3Schema


class DeletionEvent(BaseModel):
    Records: list[Record]

event = {
  "Records": [
    {
      "s3": {
        "bucket": {
          "name": "my-bucket"
        },
        "object": {
          "key": "MYKEY%28CSV%29/XXXX.CSV"
        }
      }
    }
  ]
}

eventTyped = DeletionEvent(**event)

This properly converts the URL encoded string "MYKEY%28CSV%29/XXXX.CSV" to the normal string "MYKEY(CSV)/XXXX.CSV".

My understanding:

  1. Pydantic first converts the str to bytes behind the scenes.
  2. MyEncoder.decode is called on the bytes object.
  3. urllib.parse.unquote is used to decode the URL string and returns a str.
  4. Pydantic expects decode to return a bytes object.
  5. Pydantic converts the bytes object back to a str

Upvotes: 0

Related Questions