Reputation: 45
I am developing Pandas DataFrame Schema validation code (in python) using pandera and am looking for the best approach to verify a DataFrame has unique values for a combination of columns.
The original data is supplied by others and is in a CSV format. My code loads the CSV into a Pandas DataFrame and then does a pandera DataFrameSchema validate The dataframe has columns for geographic coordinate system using X and Y coordinates. The nature of the data is that each row in the data set should have a unique X,Y coordinate.
The csv file has the general form:
x_coord, y_coord, value_A, value_B
12.1234, 23.2345, 27.23, 32.84
34.3456, 45.4567, 21.12, 22.32
....
....
Using pandera, the only way that I can think of doing this is:
Take a multi-step approach:
The approach seems a little clunky, and I am looking for other options that take more advantage of the flexibility in pandera.
Code to implement the above is:
import pandas as pd
import pandera as pa
def get_valid_coord_df(filename):
df = pd.read_csv(filename)
schema = DataFrameSchema(columns={
'x_coord': pa.Column(pa.Float,
Check.in_range(0, 100_000)),
'y_coord': pa.Column(pa.Float,
Check.in_range(0, 100_000)),
'value_A': pa.Column(pa.Float,
Check.in_range(0, 20)),
'value_B': pa.Column(pa.Float,
Check.in_range(0, 20))
},
strict=True, ordered=True)
schema.validate(df, lazy=True) #will raise SchemaError
df_coord = pd.DataFrame(df['x_coord'].astype(str) + ',' + df['x_coord'].astype(str))
schema_coord = DataFrameSchema(columns={
'x_coord': pa.Column(pa.String,
allow_duplicates=False)})
schema_coord.validate(df_coord, lazy=True) #will raise SchemaError
return df
Upvotes: 4
Views: 5049
Reputation: 169
Here you can use wide checks to have access to the entire dataframe in the Check
function arg:
import pandera as pa
schema = pa.DataFrameSchema(
columns={
"x_coord": pa.Column(pa.Float, pa.Check.in_range(0, 100_000)),
"y_coord": pa.Column(pa.Float, pa.Check.in_range(0, 100_000)),
"value_A": pa.Column(pa.Float, pa.Check.in_range(0, 20)),
"value_B": pa.Column(pa.Float, pa.Check.in_range(0, 20)),
},
# checks at the DataFrameSchema level take the dataframe as input
# (as opposed to the series at the Column level)
checks=pa.Check(lambda df: ~df[["x_coord", "y_coord"]].duplicated()),
strict=True,
ordered=True,
)
If you find yourself using this check more often in your codebase, you can also define a check extension so that you have access to it in the pa.Check
namespace:
import pandera.extensions as extensions
@extensions.register_check_method(
statistics=["columns"], supported_types=pd.DataFrame,
)
def unique_columns(pandas_obj, *, columns):
return ~pandas_obj[columns].duplicated()
schema = pa.DataFrameSchema(
checks=pa.Check.unique_columns(["x_coord", "y_coord"])
)
schema.validate(
pd.DataFrame(
{
"x_coord": [1.0, 2.0, 3.0],
"y_coord": [4.0, 5.0, 6.0],
}
)
)
Upvotes: 6