Reputation: 625
I have the following table cmf_data
(here's the DDL I used to create it):
CREATE TABLE mydb.cmf_data (
cmf_data_id int IDENTITY(1,1) NOT NULL,
cmf_data_field_name varchar(MAX) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
cmf_data_field_data_type varchar(MAX) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
CONSTRAINT PK__client_m__BC8D4A1B625F6F4F PRIMARY KEY (cmf_data_id)
);
As you can see, this table stores field names and expected types. Here's an example output from running SELECT * FROM cmf_data;
:
cmf_data_id|cmf_data_field_name|cmf_data_field_data_type|
-----------+-------------------+------------------------+
1|Foo |float |
2|Baz |float |
3|Fizz |datetime |
4|Buzz |string |
Here is my Excel spreadsheet, in CSV format:
Foo,Baz,Fizz,Buzz
23.5,44.18,'2022-06-18','Hello there',
24.621,7.3,'2023-01-07','How is it going',
149.0,712.0,'2022-02-09','What a nice day',
11.74,101.03,'2021-10-26','Thank you so much'
The idea is that the cmf_data
dictates the expectations my code has on the column types found in the Excel spreadsheet. So if, say, the spreadsheet's Baz
column contains a value that Pandas decides is a string
, since the cmf_data
table has Baz
as a float
, that should fail validation and raise an appropriate validation-related (or custom) error.
Here's what I am trying to do:
cmf_data
table into a DataFrame (say, cmf_data_df
)excel_df
)excel_df
do not match the rows (based on cmf_field_name
) defined in cmf_data_df
, raise a validation exception (hence the Excel can't contain any columns not defined in the cmf_data
table; and it can't omit any columns either; they must be exact)cmf_data_df
, upon the rows and cells of the excel_df
; if any particular row contains a value that cannot be cast/converted or conform to the type expectations on its column, raise a validation exceptionHence, if my Excel/CSV looks like it does above, it should validate just fine; the Excel's columns match the expectations of the CMF Data table exactly, and all the rows' cell values match the expectations of the CMF Data table as well.
But if the Excel/CSV looks like:
Foo,Baz,Buzz
23.5,44.18,'Hello there',
24.621,7.3,'How is it going',
149.0,712.0,'What a nice day',
11.74,101.03,'Thank you so much'
We should get a validation exception because the column names do not match what is expected. Similarly, if the Excel/CSV looks like:
Foo,Baz,Fizz,Buzz
23.5,44.18,'2022-06-18','Hello there',
24.621,7.3,'2023-01-07','How is it going',
149.0,'I should fail the validation','2022-02-09','What a nice day',
11.74,101.03,'2021-10-26','Thank you so much'
We should also get a validation exception, because on row 3, the Baz value is a string
, which violates its definition in the CMF Data table (expecting it to be a float
).
My best attempt thus far is:
# read cmf_data_df from DB
params = urllib.parse.quote_plus("Driver={ODBC Driver 17 for SQL Server};"
f"Server={host};"
f"Database={database};"
f"uid={uid};pwd={pwd}")
engine = create_engine("mssql+pyodbc:///?odbc_connect={}".format(params), fast_executemany=True)
query = "SELECT * FROM mydb.cmf_data"
result = engine.execute(query)
data = result.fetchall()
col = list(result.keys())
cmf_data_df = pd.DataFrame(columns=col, data=data)
# read excel -- I'm actually pulling the file from S3
# which is what the 'obj['Body'].read()' is for
data = pd.read_excel(io.BytesIO(obj['Body'].read()), engine="openpyxl")
excel_df = pd.DataFrame(data)
# how to validate that the columns in 'excel_df' match rows definitions in 'cmf_data_df' ?
# how to perform row-level validation on all the types in 'excel_df'?
Can anyone point me in the right direction to piece the whole thing together correctly?
Upvotes: 0
Views: 7790
Reputation: 2133
Maybe this won't be a comprehensive solution, but I would use pandera, I let you a tiny demonstration:
import numpy as np
import pandas as pd
import pandera as pa
from pandera import Check, Column, DataFrameSchema
# here you would write your data/excel_data_df
# df = pd.read_excel(io.BytesIO(obj['Body'].read()), engine="openpyxl")
df = pd.DataFrame({"column1": [5, 1, np.nan]})
# here a dictionary with col and the types
# for you something like cmf_data_df.to_dict()
my_types = {'column1':'int'}
my_schema = DataFrameSchema({i:Column(j) for i,j in my_types.items()})
# finally you validate the result row-wise
my_schema.validate(df)
The error raised will tell you row-wise the validation error, but you can change that behavior in DataFrameSchema strict='filter'
so the validation will not raise an error but return the valid rows, among other possible behaviors.
There is many other functionalities and integrations of pandera that may help in your use-case, but the basic idea is loading your schema in DataFrameSchema of pandera and let pandera ding the validation for you.
Upvotes: 1
Reputation: 976
Here is my solution to the two validation steps:
A DataFrame representing cmf_data
:
import pandas as pd
data = {
'cmf_data_id': [1, 2, 3, 4],
'cmf_data_field_name': ['Foo', 'Baz', 'Fizz', 'Buzz'],
'cmf_data_field_data_type': ['float', 'float', 'datetime', 'string']
}
cmf_data_df = pd.DataFrame(data)
print(cmf_data_df)
cmf_data_id cmf_data_field_name cmf_data_field_data_type 0 1 Foo float 1 2 Baz float 2 3 Fizz datetime 3 4 Buzz string
A DataFrame representing the Excel spreadsheet data:
data = {
'Foo': [23.5, 24.621, 149.0, 11.74],
'Baz': [44.18, 7.3, 712.0, 101.03],
'Fizz': ['2022-06-18', '2023-01-07', '2022-02-09', '2021-10-26'],
'Buzz': ['Hello there', 'How is it going', 'What a nice day', 'Thank you so much']
}
excel_df = pd.DataFrame(data)
print(excel_df)
Foo Baz Fizz Buzz 0 23.500 44.18 2022-06-18 Hello there 1 24.621 7.30 2023-01-07 How is it going 2 149.000 712.00 2022-02-09 What a nice day 3 11.740 101.03 2021-10-26 Thank you so much
The columns can be validated according to the following conditions:
The number of columns in the excel spreadsheet is the same as the number of field names stored in cmf_data
.
Assuming that condition 1 is true, the columns match pairwise with the field names (the order of columns matters here).
To implement the first condition, the len()
function can be used to get the number of columns/field-names.
To implement the second condition, the .all()
method can be used after comparing pairwise, to ensure all the matches were successful.
# Get the field names stored in cmf_data
field_names = cmf_data_df['cmf_data_field_name']
# Get the columns of the excel spreadsheet.
excel_cols = excel_df.columns
# Are they equal in number, AND (if so) do they actually match pairwise?
column_validation = len(field_names) == len(excel_cols) and (field_names == excel_cols).all()
# Raise an error if this validation fails.
if not column_validation:
raise Exception("Columns do not match the fields.")
You could split these up as follows if you want particular validation exceptions (e.g. "Wrong number of columns", or "At least one column does not match field.'").
field_names = cmf_data_df['cmf_data_field_name']
excel_cols = excel_df.columns
if len(field_names) != len(excel_cols):
raise Exception(f"Expected {len(field_names)} columns, but got {len(excel_cols)}.")
if not (field_names == excel_cols).all():
raise Exception(f"Columns do not all match the fields.")
The data types can be checked by iterating over each column and converting their types to those in cmf_date_df
,
using the .astype()
method.
This will fail for 'datetime'
, which is not recognised as a type. Either store it as 'datetime64[ns]
', or use the pd.to_datetime()
function instead to convert whenever it comes up. I will use the latter approach.
# Create a copy of the excel_df DataFrame that will hold the converted columns.
new_df = excel_df.copy()
# Iterate over the index positions of each row in cmf_data_df
for i in range(0, len(cmf_data_df)):
# Get the data type corresponding to the field.
type_i = cmf_data_df.loc[i, 'cmf_data_field_data_type']
# Get the column at the equivalent (integer-based) position i.
col_i = new_df.iloc[:, i]
# Prepare to catch any possible errors:
try:
# If 'datetime' is stored, use the specialised pd.to_datetime() function to convert it.
if type_i == 'datetime':
new_df.iloc[:, i] = pd.to_datetime(col_i)
# Otherwise convert the column to the given type with the .astype() method.
else:
new_df.iloc[:, i] = col_i.astype(type_i)
# Catch any error (assumed to be raised out of failure to convert a column.)
except:
raise Exception(f"Could not convert the column '{new_df.columns[i]}' to type '{type_i}'")
Output of new_df.info()
:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Foo 4 non-null float64 1 Baz 4 non-null float64 2 Fizz 4 non-null datetime64[ns] 3 Buzz 4 non-null string dtypes: datetime64[ns](1), float64(2), string(1) memory usage: 256.0 bytes
# Checks that a given range of columns match a range of field names. (By default raise error if the matching fails)
def validate_columns(cols, field_names, raise_error=True):
column_validation = len(field_names) == len(cols) and (field_names == cols).all()
if raise_error and not column_validation:
raise Exception("Columns do not match the fields.")
return column_validation
# Attempts to create a copy of a DataFrame with columns converted to a given set of data types.
def enforce_dtypes(df, dtypes):
new_df = df.copy()
for i in range(0, len(dtypes)):
type_i = dtypes[i]
col_i = new_df.iloc[:, i]
try:
if type_i == 'datetime':
new_df.iloc[:, i] = pd.to_datetime(col_i)
else:
new_df.iloc[:, i] = col_i.astype(type_i)
except:
raise Exception(f"Could not convert column '{new_df.columns[i]}' to type '{type_i}'")
return new_df
Usage:
# Check whether the excel spreadsheet column names match up exactly with th field names stored in cmf_date.
validate_columns(cols=excel_df.columns, field_names=cmf_data_df['cmf_data_field_name'])
# Attempt to convert the columns of the excel spreadsheet to the data types for each field stored in cmf_date
new_df = enforce_dtypes(df=excel_df, dtypes=cmf_data_df['cmf_data_field_data_type'])
Excel data:
data = {
'Foo': [23.5, 24.621, 149.0, 11.74],
'Baz': [44.18, 7.3, 712.0, 101.03],
'Buzz': ['Hello there', 'How is it going', 'What a nice day', 'Thank you so much']
}
excel_df = pd.DataFrame(data)
print(excel_df)
Foo Baz Buzz 0 23.500 44.18 Hello there 1 24.621 7.30 How is it going 2 149.000 712.00 What a nice day 3 11.740 101.03 Thank you so much
Validation:
validate_columns(excel_df.columns, cmf_data_df['cmf_data_field_name'])
new_df = enforce_dtypes(excel_df, cmf_data_df['cmf_data_field_data_type'])
Error:
Exception: Columns do not match the fields.
Excel data:
data = {
'Foo': [23.5, 24.621, 149.0, 11.74],
'Baz': [44.18, 7.3, 'I should fail the validation', 101.03],
'Fizz': ['2022-06-18', '2023-01-07', '2022-02-09', '2021-10-26'],
'Buzz': ['Hello there', 'How is it going', 'What a nice day', 'Thank you so much']
}
excel_df = pd.DataFrame(data)
print(excel_df)
Foo Baz Fizz Buzz 0 23.500 44.18 2022-06-18 Hello there 1 24.621 7.3 2023-01-07 How is it going 2 149.000 I should fail the validation 2022-02-09 What a nice day 3 11.740 101.03 2021-10-26 Thank you so much
Validation:
validate_columns(cols=excel_df.columns, field_names=cmf_data_df['cmf_data_field_name'])
new_df = enforce_dtypes(df=excel_df, dtypes=cmf_data_df['cmf_data_field_data_type'])
Error:
Exception: Could not convert column 'Baz' to type 'float'
Upvotes: 2
Reputation: 23341
Pandas has a dedicated method astype
that can change dtypes of columns using a dict-like mapping. So, you can use cmf_data_df
to convert the dtypes in excel_df
. One small cleaning should be done, however. Pandas datetime dtype is named datetime64
, so on cmf_data_df
, replace datetime
to datetime64
.
The conversion works if all columns are there or if it's possible to convert each value in the frame according to the mapping.
# dtype converter mapping
mapping = cmf_data_df.set_index('cmf_data_field_name')['cmf_data_field_data_type'].str.replace('datetime', 'datetime64')
# convert dtypes
excel_df = excel_df.astype(mapping)
This code makes the conversion without error for the first case, but raises very helpful errors that can help you find exactly where the problem is. For example, for the second case, it throws a KeyError: "Only a column name can be used for the key in a dtype mappings argument. 'Fizz' not found in columns."
and for the third case, it raises a ValueError: could not convert string to float: "'I should fail the validation'"
.
Upvotes: 1