Yorian
Yorian

Reputation: 41

How to define a Pandera DataFrame schema for validating and parsing datetime columns?

I have a csv that contains datetime columns and I want to use Pandera to validate the columns and parse them to the correct format. An example value in the column would be: 2023-02-04T00:39:00+00:00.

This is currently parsed in pandas to the right format using the following python code:

column = pd.to_datetime(column, format="%Y-%m-%dT%H:%M:%S")
column = column.dt.tz_convert("Europe/Amsterdam")

I would want to define a pandera DataFrame schema such that parsing is handled "automatically" when I read the csv with the following code:

schema = DataFrameSchema(
    {
        "datetime_column": Column()  # how to implement the above here??
    },
    strict=True,
    coerce=False,
)

df = pd.read_csv(src, dtype={col: str(dtype) for col, dtype in schema.dtypes.items()})
schema.validate(df)

I already use the above approach for simple types like string, ints, etc. But how would I do this for DateTime types (usually tz-aware)?

There is not a lot of documentation so I couldn't figure it out from the online documentation so far.

Upvotes: 4

Views: 5048

Answers (1)

Icaro15
Icaro15

Reputation: 21

Pandera allows you to create new custom data types to include in the schema (Link). If you want to use a datetime type to coerce a column with specific format, you can do it using pandas_engine.DateTime, importing all the columns in the .csv as string and after that, coercing the schema. Here's an example:

import pandas as pd
import pandera as pa
from pandera.engines import pandas_engine

df = pd.read_csv(r"test.csv", dtype=str)
print(df.dtypes) #All cols as object type

schema = pa.DataFrameSchema(
            {"date_col": pa.Column(
                pandas_engine.DateTime(to_datetime_kwargs = {"format":"%Y-%m-%dT%H:%M:%S"},
                                       tz = "Europe/Amsterdam")
                ),
            "col_1": pa.Column("object"),
            "col_2": pa.Column("int32")},
            coerce=True)
# Validating and coercing data types
new_df = schema.validate(df)
print(new_df.dtypes) #New dtypes for all cols

Upvotes: 2

Related Questions