hotmeatballsoup
hotmeatballsoup

Reputation: 625

Validating Pandas DataFrame columns and cell values

I have the following table cmf_data (here's the DDL I used to create it):

CREATE TABLE mydb.cmf_data (
    cmf_data_id int IDENTITY(1,1) NOT NULL,
    cmf_data_field_name varchar(MAX) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    cmf_data_field_data_type varchar(MAX) COLLATE SQL_Latin1_General_CP1_CI_AS NOT NULL,
    CONSTRAINT PK__client_m__BC8D4A1B625F6F4F PRIMARY KEY (cmf_data_id)
);

As you can see, this table stores field names and expected types. Here's an example output from running SELECT * FROM cmf_data;:

cmf_data_id|cmf_data_field_name|cmf_data_field_data_type|
-----------+-------------------+------------------------+
          1|Foo                |float                   |
          2|Baz                |float                   |
          3|Fizz               |datetime                |
          4|Buzz               |string                  |

Here is my Excel spreadsheet, in CSV format:

Foo,Baz,Fizz,Buzz
23.5,44.18,'2022-06-18','Hello there',
24.621,7.3,'2023-01-07','How is it going',
149.0,712.0,'2022-02-09','What a nice day',
11.74,101.03,'2021-10-26','Thank you so much'

The idea is that the cmf_data dictates the expectations my code has on the column types found in the Excel spreadsheet. So if, say, the spreadsheet's Baz column contains a value that Pandas decides is a string, since the cmf_data table has Baz as a float, that should fail validation and raise an appropriate validation-related (or custom) error.

Here's what I am trying to do:

  1. Use Pandas to read the cmf_data table into a DataFrame (say, cmf_data_df)
  2. Use Pandas to read the spreadsheet into another DataFrame (say, excel_df)
  3. If the columns in excel_df do not match the rows (based on cmf_field_name) defined in cmf_data_df, raise a validation exception (hence the Excel can't contain any columns not defined in the cmf_data table; and it can't omit any columns either; they must be exact)
  4. Enforce the column types, defined in the cmf_data_df, upon the rows and cells of the excel_df; if any particular row contains a value that cannot be cast/converted or conform to the type expectations on its column, raise a validation exception

Hence, if my Excel/CSV looks like it does above, it should validate just fine; the Excel's columns match the expectations of the CMF Data table exactly, and all the rows' cell values match the expectations of the CMF Data table as well.

But if the Excel/CSV looks like:

Foo,Baz,Buzz
23.5,44.18,'Hello there',
24.621,7.3,'How is it going',
149.0,712.0,'What a nice day',
11.74,101.03,'Thank you so much'

We should get a validation exception because the column names do not match what is expected. Similarly, if the Excel/CSV looks like:

Foo,Baz,Fizz,Buzz
23.5,44.18,'2022-06-18','Hello there',
24.621,7.3,'2023-01-07','How is it going',
149.0,'I should fail the validation','2022-02-09','What a nice day',
11.74,101.03,'2021-10-26','Thank you so much'

We should also get a validation exception, because on row 3, the Baz value is a string, which violates its definition in the CMF Data table (expecting it to be a float).

My best attempt thus far is:

# read cmf_data_df from DB
params = urllib.parse.quote_plus("Driver={ODBC Driver 17 for SQL Server};"
                                 f"Server={host};"
                                 f"Database={database};"
                                 f"uid={uid};pwd={pwd}")
engine = create_engine("mssql+pyodbc:///?odbc_connect={}".format(params), fast_executemany=True)

query = "SELECT * FROM mydb.cmf_data"
result = engine.execute(query)
data = result.fetchall()
col = list(result.keys())
cmf_data_df = pd.DataFrame(columns=col, data=data)

# read excel -- I'm actually pulling the file from S3
# which is what the 'obj['Body'].read()' is for
data = pd.read_excel(io.BytesIO(obj['Body'].read()), engine="openpyxl")
excel_df = pd.DataFrame(data)

# how to validate that the columns in 'excel_df' match rows definitions in 'cmf_data_df' ?

# how to perform row-level validation on all the types in 'excel_df'?

Can anyone point me in the right direction to piece the whole thing together correctly?

Upvotes: 0

Views: 7790

Answers (3)

Ziur Olpa
Ziur Olpa

Reputation: 2133

Maybe this won't be a comprehensive solution, but I would use pandera, I let you a tiny demonstration:

import numpy as np
import pandas as pd
import pandera as pa

from pandera import Check, Column, DataFrameSchema

# here you would write your data/excel_data_df
# df = pd.read_excel(io.BytesIO(obj['Body'].read()), engine="openpyxl")
df = pd.DataFrame({"column1": [5, 1, np.nan]})

# here a dictionary with col and the types
# for you something like cmf_data_df.to_dict()
my_types = {'column1':'int'}
my_schema = DataFrameSchema({i:Column(j) for i,j in my_types.items()})

# finally you validate the result row-wise
my_schema.validate(df)

The error raised will tell you row-wise the validation error, but you can change that behavior in DataFrameSchema strict='filter' so the validation will not raise an error but return the valid rows, among other possible behaviors.

There is many other functionalities and integrations of pandera that may help in your use-case, but the basic idea is loading your schema in DataFrameSchema of pandera and let pandera ding the validation for you.

Upvotes: 1

user21283023
user21283023

Reputation: 976

Here is my solution to the two validation steps:


Sample Data

A DataFrame representing cmf_data:

import pandas as pd
data = {
    'cmf_data_id': [1, 2, 3, 4],
    'cmf_data_field_name': ['Foo', 'Baz', 'Fizz', 'Buzz'],
    'cmf_data_field_data_type': ['float', 'float', 'datetime', 'string']
}

cmf_data_df = pd.DataFrame(data)
print(cmf_data_df)
   cmf_data_id cmf_data_field_name cmf_data_field_data_type
0            1                 Foo                    float
1            2                 Baz                    float
2            3                Fizz                 datetime
3            4                Buzz                   string

A DataFrame representing the Excel spreadsheet data:

data = {
    'Foo': [23.5, 24.621, 149.0, 11.74],
    'Baz': [44.18, 7.3, 712.0, 101.03],
    'Fizz': ['2022-06-18', '2023-01-07', '2022-02-09', '2021-10-26'],

    'Buzz': ['Hello there', 'How is it going', 'What a nice day', 'Thank you so much']
}
excel_df = pd.DataFrame(data)
print(excel_df)
       Foo     Baz        Fizz               Buzz
0   23.500   44.18  2022-06-18        Hello there
1   24.621    7.30  2023-01-07    How is it going
2  149.000  712.00  2022-02-09    What a nice day
3   11.740  101.03  2021-10-26  Thank you so much

1) Validating columns

The columns can be validated according to the following conditions:

  1. The number of columns in the excel spreadsheet is the same as the number of field names stored in cmf_data.

  2. Assuming that condition 1 is true, the columns match pairwise with the field names (the order of columns matters here).

To implement the first condition, the len() function can be used to get the number of columns/field-names.

To implement the second condition, the .all() method can be used after comparing pairwise, to ensure all the matches were successful.

# Get the field names stored in cmf_data
field_names = cmf_data_df['cmf_data_field_name']
# Get the columns of the excel spreadsheet.
excel_cols  = excel_df.columns

# Are they equal in number, AND (if so) do they actually match pairwise?
column_validation = len(field_names) == len(excel_cols) and (field_names == excel_cols).all()

# Raise an error if this validation fails.
if not column_validation:
    raise Exception("Columns do not match the fields.")

You could split these up as follows if you want particular validation exceptions (e.g. "Wrong number of columns", or "At least one column does not match field.'").

field_names = cmf_data_df['cmf_data_field_name']
excel_cols  = excel_df.columns

if len(field_names) != len(excel_cols):
    raise Exception(f"Expected {len(field_names)} columns, but got {len(excel_cols)}.")

if not (field_names == excel_cols).all():
    raise Exception(f"Columns do not all match the fields.")

2) Validating data types

The data types can be checked by iterating over each column and converting their types to those in cmf_date_df, using the .astype() method.

This will fail for 'datetime', which is not recognised as a type. Either store it as 'datetime64[ns]', or use the pd.to_datetime() function instead to convert whenever it comes up. I will use the latter approach.

# Create a copy of the excel_df DataFrame that will hold the converted columns.
new_df = excel_df.copy()
# Iterate over the index positions of each row in cmf_data_df
for i in range(0, len(cmf_data_df)):
    
    # Get the data type corresponding to the field.
    type_i = cmf_data_df.loc[i, 'cmf_data_field_data_type']
    
    # Get the column at the equivalent (integer-based) position i.
    col_i = new_df.iloc[:, i]
    
    # Prepare to catch any possible errors:
    try:
        # If 'datetime' is stored, use the specialised pd.to_datetime() function to convert it.
        if type_i == 'datetime':
            new_df.iloc[:, i] = pd.to_datetime(col_i)
            
        # Otherwise convert the column to the given type with the .astype() method.
        else:
            new_df.iloc[:, i] = col_i.astype(type_i)
    
    # Catch any error (assumed to be raised out of failure to convert a column.)
    except:
        raise Exception(f"Could not convert the column '{new_df.columns[i]}' to type '{type_i}'")

Output of new_df.info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Foo     4 non-null      float64       
 1   Baz     4 non-null      float64       
 2   Fizz    4 non-null      datetime64[ns]
 3   Buzz    4 non-null      string        
dtypes: datetime64[ns](1), float64(2), string(1)
memory usage: 256.0 bytes

Full code (as generalised functions)

# Checks that a given range of columns match a range of field names. (By default raise error if the matching fails)
def validate_columns(cols, field_names, raise_error=True):
    
    column_validation = len(field_names) == len(cols) and (field_names == cols).all()
    
    if raise_error and not column_validation:
        raise Exception("Columns do not match the fields.")
        
    return column_validation

        
# Attempts to create a copy of a DataFrame with columns converted to a given set of data types. 
def enforce_dtypes(df, dtypes):
    new_df = df.copy()
    
    for i in range(0, len(dtypes)):
    
        type_i = dtypes[i]
    
        col_i = new_df.iloc[:, i]
    
        try:
            if type_i == 'datetime':
                new_df.iloc[:, i] = pd.to_datetime(col_i)
            else:
                new_df.iloc[:, i] = col_i.astype(type_i)
    
        except:
            raise Exception(f"Could not convert column '{new_df.columns[i]}' to type '{type_i}'")
    
    return new_df

Usage:

# Check whether the excel spreadsheet column names match up exactly with th field names stored in cmf_date.
validate_columns(cols=excel_df.columns, field_names=cmf_data_df['cmf_data_field_name'])

# Attempt to convert the columns of the excel spreadsheet to the data types for each field stored in cmf_date
new_df = enforce_dtypes(df=excel_df, dtypes=cmf_data_df['cmf_data_field_data_type'])

Examples

Unmatchable columns

Excel data:

data = {
    'Foo': [23.5, 24.621, 149.0, 11.74],
    'Baz': [44.18, 7.3, 712.0, 101.03],
    'Buzz': ['Hello there', 'How is it going', 'What a nice day', 'Thank you so much']
}
excel_df = pd.DataFrame(data)
print(excel_df)
       Foo     Baz               Buzz
0   23.500   44.18        Hello there
1   24.621    7.30    How is it going
2  149.000  712.00    What a nice day
3   11.740  101.03  Thank you so much

Validation:

validate_columns(excel_df.columns, cmf_data_df['cmf_data_field_name'])
new_df = enforce_dtypes(excel_df, cmf_data_df['cmf_data_field_data_type'])

Error:

Exception: Columns do not match the fields.

Unconvertible columns

Excel data:

data = {
    'Foo': [23.5, 24.621, 149.0, 11.74],
    'Baz': [44.18, 7.3, 'I should fail the validation', 101.03],
    'Fizz': ['2022-06-18', '2023-01-07', '2022-02-09', '2021-10-26'],

    'Buzz': ['Hello there', 'How is it going', 'What a nice day', 'Thank you so much']
}
excel_df = pd.DataFrame(data)
print(excel_df)
       Foo                           Baz        Fizz               Buzz
0   23.500                         44.18  2022-06-18        Hello there
1   24.621                           7.3  2023-01-07    How is it going
2  149.000  I should fail the validation  2022-02-09    What a nice day
3   11.740                        101.03  2021-10-26  Thank you so much

Validation:

validate_columns(cols=excel_df.columns, field_names=cmf_data_df['cmf_data_field_name'])
new_df = enforce_dtypes(df=excel_df, dtypes=cmf_data_df['cmf_data_field_data_type'])

Error:

Exception: Could not convert column 'Baz' to type 'float'

Upvotes: 2

cottontail
cottontail

Reputation: 23341

Pandas has a dedicated method astype that can change dtypes of columns using a dict-like mapping. So, you can use cmf_data_df to convert the dtypes in excel_df. One small cleaning should be done, however. Pandas datetime dtype is named datetime64, so on cmf_data_df, replace datetime to datetime64.

The conversion works if all columns are there or if it's possible to convert each value in the frame according to the mapping.

# dtype converter mapping
mapping = cmf_data_df.set_index('cmf_data_field_name')['cmf_data_field_data_type'].str.replace('datetime', 'datetime64')
# convert dtypes
excel_df = excel_df.astype(mapping)

This code makes the conversion without error for the first case, but raises very helpful errors that can help you find exactly where the problem is. For example, for the second case, it throws a KeyError: "Only a column name can be used for the key in a dtype mappings argument. 'Fizz' not found in columns." and for the third case, it raises a ValueError: could not convert string to float: "'I should fail the validation'".

Upvotes: 1

Related Questions