Terry Snow
Terry Snow

Reputation: 45

Pandas dataframe schema validation for combination of columns

I am developing Pandas DataFrame Schema validation code (in python) using pandera and am looking for the best approach to verify a DataFrame has unique values for a combination of columns.

The original data is supplied by others and is in a CSV format. My code loads the CSV into a Pandas DataFrame and then does a pandera DataFrameSchema validate The dataframe has columns for geographic coordinate system using X and Y coordinates. The nature of the data is that each row in the data set should have a unique X,Y coordinate.

The csv file has the general form:
x_coord, y_coord, value_A, value_B
12.1234, 23.2345, 27.23, 32.84
34.3456, 45.4567, 21.12, 22.32
....
....

Using pandera, the only way that I can think of doing this is:

Take a multi-step approach:

  1. Load the csv file into a pandas DataFrame.
  2. Create a pandas single column DataFrame where the column name is (say) 'coords' and the values are generated from the string combination of the csv DataFrame coordinate columns.
  3. Validate the the coords DataFrame with a pandera DataFrameSchema that has a column check for uniqueness in that column using a pandera column with allow_duplicates=False.
  4. Validate the csv DataFrame with its own pandera schema
  5. Combine the schema errors from the two schema validations and raise that as an error.

The approach seems a little clunky, and I am looking for other options that take more advantage of the flexibility in pandera.

Code to implement the above is:

import pandas as pd
import pandera as pa

def get_valid_coord_df(filename):
  df = pd.read_csv(filename)

  schema = DataFrameSchema(columns={
        'x_coord': pa.Column(pa.Float,
                              Check.in_range(0, 100_000)),
        'y_coord': pa.Column(pa.Float,
                              Check.in_range(0, 100_000)),
        'value_A': pa.Column(pa.Float,
                              Check.in_range(0, 20)),
        'value_B': pa.Column(pa.Float,
                              Check.in_range(0, 20))
        },
        strict=True, ordered=True)

  schema.validate(df, lazy=True) #will raise SchemaError

  df_coord = pd.DataFrame(df['x_coord'].astype(str) + ',' + df['x_coord'].astype(str))

  schema_coord = DataFrameSchema(columns={
       'x_coord': pa.Column(pa.String,
                            allow_duplicates=False)})

  schema_coord.validate(df_coord, lazy=True) #will raise SchemaError

  return df

         
 

Upvotes: 4

Views: 5049

Answers (1)

cosmicBboy
cosmicBboy

Reputation: 169

Here you can use wide checks to have access to the entire dataframe in the Check function arg:

import pandera as pa

schema = pa.DataFrameSchema(
    columns={
        "x_coord": pa.Column(pa.Float, pa.Check.in_range(0, 100_000)),
        "y_coord": pa.Column(pa.Float, pa.Check.in_range(0, 100_000)),
        "value_A": pa.Column(pa.Float, pa.Check.in_range(0, 20)),
        "value_B": pa.Column(pa.Float, pa.Check.in_range(0, 20)),
    },
    # checks at the DataFrameSchema level take the dataframe as input
    # (as opposed to the series at the Column level)
    checks=pa.Check(lambda df: ~df[["x_coord", "y_coord"]].duplicated()),
    strict=True,
    ordered=True,
)

If you find yourself using this check more often in your codebase, you can also define a check extension so that you have access to it in the pa.Check namespace:

import pandera.extensions as extensions

@extensions.register_check_method(
    statistics=["columns"], supported_types=pd.DataFrame,
)
def unique_columns(pandas_obj, *, columns):
    return ~pandas_obj[columns].duplicated()


schema = pa.DataFrameSchema(
    checks=pa.Check.unique_columns(["x_coord", "y_coord"])
)

schema.validate(
    pd.DataFrame(
        {
            "x_coord": [1.0, 2.0, 3.0],
            "y_coord": [4.0, 5.0, 6.0],
        }
    )
)

Upvotes: 6

Related Questions