Reputation: 41
I have a csv that contains datetime columns and I want to use Pandera to validate the columns and parse them to the correct format. An example value in the column would be: 2023-02-04T00:39:00+00:00
.
This is currently parsed in pandas to the right format using the following python code:
column = pd.to_datetime(column, format="%Y-%m-%dT%H:%M:%S")
column = column.dt.tz_convert("Europe/Amsterdam")
I would want to define a pandera DataFrame schema such that parsing is handled "automatically" when I read the csv with the following code:
schema = DataFrameSchema(
{
"datetime_column": Column() # how to implement the above here??
},
strict=True,
coerce=False,
)
df = pd.read_csv(src, dtype={col: str(dtype) for col, dtype in schema.dtypes.items()})
schema.validate(df)
I already use the above approach for simple types like string, ints, etc. But how would I do this for DateTime types (usually tz-aware)?
There is not a lot of documentation so I couldn't figure it out from the online documentation so far.
Upvotes: 4
Views: 5048
Reputation: 21
Pandera allows you to create new custom data types to include in the schema (Link). If you want to use a datetime type to coerce a column with specific format, you can do it using pandas_engine.DateTime, importing all the columns in the .csv as string and after that, coercing the schema. Here's an example:
import pandas as pd
import pandera as pa
from pandera.engines import pandas_engine
df = pd.read_csv(r"test.csv", dtype=str)
print(df.dtypes) #All cols as object type
schema = pa.DataFrameSchema(
{"date_col": pa.Column(
pandas_engine.DateTime(to_datetime_kwargs = {"format":"%Y-%m-%dT%H:%M:%S"},
tz = "Europe/Amsterdam")
),
"col_1": pa.Column("object"),
"col_2": pa.Column("int32")},
coerce=True)
# Validating and coercing data types
new_df = schema.validate(df)
print(new_df.dtypes) #New dtypes for all cols
Upvotes: 2