Tensorflow Data Validation does not identify anomalies in numerical features

I've been testing Tensorflow Data Validation (version 0.22.0) to use in my current ML pipelines and I noticed it does not get any anomaly in numerical features. For instance,

> import pandas as pd  
> import pyarrow 
> import tensorflow as tf 
> import apache_beam as beam 
> import apache_beam.io.iobase 
> import tensorflow_data_validation as tfdv 
> print('TFDV version: {}'.format(tfdv.version.__version__))
> 
> train_df = pd.DataFrame({
>     'FeatA' : ['A'] * 1000,
>     'FeatB' : ['B'] * 1000,
>     'FeatC' : [10] * 1000,
>     'FeatD' : [50.2] * 1000 })
> 
> eval_df = pd.DataFrame({
>     'FeatA' : ['A1'] * 1000,
>     'FeatB' : ['B1'] * 1000,
>     'FeatC' : [4] * 1000,
>     'FeatD' : [200.43] * 1000 })
> 
> train_stats  = tfdv.generate_statistics_from_dataframe(train_df)
> schema = tfdv.infer_schema(statistics = train_stats) 
> eval_stats = tfdv.generate_statistics_from_dataframe(eval_df) 
> anomalies = tfdv.validate_statistics(statistics = eval_stats, schema = schema)
> tfdv.display_anomalies(anomalies)

The anomalies were detected only in FeatA and FeatB which are categorical ones. But in FeatC and FeatD, TFDV does not detect anything.

The result is shown in this image

I've tried also setting skew and drift comparators, but no changes. I guess it has to do with the auto-generated schema which has no domain mapped for the numerical features.

Anyone has any idea of how to get TFDV working for numerical features?

Upvotes: 1

Views: 1145

Answers (3)

dp6000
dp6000

Reputation: 673

As explained by @durga, TFDV has added a new feature that allows us to detect skew for numeric features. Specify a jensen_shannon_divergence threshold instead of an infinity_norm threshold in the skew_comparator.

Example:

tfdv.get_feature(schema, 'total_actions').skew_comparator.jensen_shannon_divergence.threshold = 0.01

If you want to check for max and min value range, you need to manually set an inline FloatDomain/IntDomain in the Feature. It's not generated automatically by infer_schema():

Example:

tfdv.get_feature(schema, 'total_actions').int_domain.name = 'total_actions'
tfdv.get_feature(schema, 'total_actions').int_domain.min = 0
tfdv.get_feature(schema, 'total_actions').int_domain.max = 1400

Upvotes: 0

durga
durga

Reputation: 11

We need use jensen_shannon_divergence skew comparator for Numerical Features and infinity_norm for Categorical Features

tfdv.get_feature(schema_updated,'SALES').skew_comparator.jensen_shannon_divergence.threshold = 0.001

skew_anomalies = tfdv.validate_statistics(statistics=new_dataset_stats, schema=schema, serving_statistics=old_dataset_stats) display_anomalies(skew_anomalies)

Upvotes: 0

Amine_h
Amine_h

Reputation: 129

Normally, tfdv does not infer domains for numerical values, you have now 3 possible solutions:

1- Change the type of the dataframe column to str and thus it will be considered as a Bytes feature.

2- Add an int_domain (float_domain for FeatD) to your features and determine your desired min and max

3- Only for int features you can set int_domain.is_categorical to True, and then use a drift/skew comparator. You will be able to detect new values within the Top k value.

Upvotes: 0

Related Questions