Joachim
Joachim

Reputation: 3270

Type-checking Pandas DataFrames

I want to type-check Pandas DataFrames i.e. I want to specify which column labels a DataFrame must have and what kind of data type (dtype) is stored in them. A crude implementation (inspired by this question) would work like this:

from collections import namedtuple
Col = namedtuple('Col', 'label, type')

def dataframe_check(*specification):
    def check_accepts(f):
        assert len(specification) <= f.__code__.co_argcount
        def new_f(*args, **kwds):
            for (df, specs) in zip(args, specification):
                spec_columns = [spec.label for spec in specs]
                assert (df.columns == spec_columns).all(), \
                  'Columns dont match specs {}'.format(spec_columns)
                
                spec_dtypes = [spec.type for spec in specs]
                assert (df.dtypes == spec_dtypes).all(), \
                  'Dtypes dont match specs {}'.format(spec_dtypes)
            return f(*args, **kwds)
        new_f.__name__ = f.__name__
        return new_f
    return check_accepts

I don't mind the complexity of the checking function but it adds a lot of boilerplate code.

@dataframe_check([Col('a', int), Col('b', int)],    #  df1
                 [Col('a', int), Col('b', float)],) #  df2
def f(df1, df2):
    return df1 + df2

f(df, df)

Is there a more Pythonic way of type-checking DataFrames? Something that looks more like the new Python 3.6 static type-checking?

Is it possible to implement it in mypy?

Upvotes: 8

Views: 6260

Answers (2)

artoby
artoby

Reputation: 1940

Try pandera. It's powerful and easy to add.

Example:

from pandera import Field, SchemaModel, check_types
from pandera.typing import DataFrame, Index, Series, Float64


class RawPriceSchema(SchemaModel):
    index: Index[int] = Field(unique=True)
    symbol: Series[str]
    price: Series[Float64] = Field(nullable=True)


RawPrice = DataFrame[RawPriceSchema]


# ...

@check_types
def foo(price: RawPrice):
    ...

A more detailed example in this demo repo and in this video

Upvotes: 0

Dan
Dan

Reputation: 671

Perhaps not the most pythonic way, but using a dict for your specs might do the trick (with keys as column names and values as data types):

import pandas as pd

df = pd.DataFrame(columns=['col1', 'col2'])
df['col1'] = df['col1'].astype('int')
df['col2'] = df['col2'].astype('str')

cols_dtypes_req = {'col1':'int', 'col2':'object'} #'str' dtype is 'object' in pandas

def check_df(dataframe, specs):
    for colname in specs:
        if colname not in dataframe:
            return 'Column missing.'
        elif dataframe[colname].dtype != specs[colname]:
            return 'Data type incorrect.'
    for dfcol in dataframe:
        if dfcol not in specs:
            return 'Unexpected dataframe column.'
    return 'Dataframe meets specifications.'

print(check_df(df, cols_dtypes_req))

Upvotes: 2

Related Questions