TomNorway
TomNorway

Reputation: 3162

How to serialize a polars dataframe type in a pydantic v2 basemodel?

I have a pydantic (v2) BaseModel that can take a polars DataFrame as one of its model fields. I wish to be able to serialize the dataframe. Preferably, I would be able to serialize AND de-serialize it, but I would be happy with just being able to serialize it.

The polars dataframe has a df.write_json() method. My thinking has been that I would take the json output from that method and read it back in via the python json library, so that it becomes a json-serializeable dict. Then I would somehow attach this "encoder" to the pydantic json method. For the deserialization process, I would use the pl.read_json() method to produce a dataframe.

Unfortunately, in the pydantic documentation, I can tell how to write a custom serializer for a named field, but not for a given type.

There are some docs on serializing subclasses by introducing a __get_pydantic_core_schema__ class method, but I would prefer to avoid this approach, since I would like to be able to use the polars classes directly.

Here is an example where currently, Foo().model_dump_json() results in a PydanticSerializationError: Unable to serialize unknown type: <class 'polars.dataframe.frame.DataFrame'> error.

from typing import Any
from pydantic import BaseModel
import polars as pl
import json

df = pl.DataFrame({"foo":[1,2,3], "bar":[4,5,6]})
df.write_json() # this produces a json representation of my dataframe
# {"columns":[{"name":"foo","datatype":"Int64","bit_settings":"","values":[1,2,3]},{"name":"bar","datatype":"Int64","bit_settings":"","values":[4,5,6]}]}

# I could use pl.read_json() to read it back into a dataframe.

def json_serializable_dataframe(df: pl.DataFrame) -> dict[str, Any]:
    "Load serialized dataframe into a serializable dict."
    return json.loads(df.write_json())

class Foo(BaseModel, arbitrary_types_allowed=True):
    df: pl.DataFrame = pl.DataFrame({"foo":[1,2,3], "bar":[4,5,6]})


Foo().model_dump_json() # how to incorporate my json_serializable_dataframe encoder here?

Is there a way to give pydantic the ability to serialize a custom type?

Upvotes: 4

Views: 2858

Answers (1)

jqurious
jqurious

Reputation: 21404

Can you use @model_serializer and manually look for DataFrames?

from pydantic import BaseModel

class Foo(BaseModel, arbitrary_types_allowed=True):
    a: pl.DataFrame = pl.DataFrame({"foo":[1], "bar":[2]})
    b: pl.DataFrame = pl.DataFrame({"baz":[3], "omg":[4]})

    @model_serializer
    def serialize(self):
        for name, obj in self.__dict__.items():
            if isinstance(obj, pl.DataFrame):
                self.__dict__[name] = obj.lazy().serialize()
        return self.__dict__

Foo().model_dump_json()
'{"a":"{\\"DataFrameScan\\":{\\"df\\":{\\"columns\\":[{\\"name\\":\\"foo\\"...

note: Polars offers frame (de-)serialization via:

Upvotes: 4

Related Questions