Reputation: 21
I've been testing Tensorflow Data Validation (version 0.22.0) to use in my current ML pipelines and I noticed it does not get any anomaly in numerical features. For instance,
> import pandas as pd
> import pyarrow
> import tensorflow as tf
> import apache_beam as beam
> import apache_beam.io.iobase
> import tensorflow_data_validation as tfdv
> print('TFDV version: {}'.format(tfdv.version.__version__))
>
> train_df = pd.DataFrame({
> 'FeatA' : ['A'] * 1000,
> 'FeatB' : ['B'] * 1000,
> 'FeatC' : [10] * 1000,
> 'FeatD' : [50.2] * 1000 })
>
> eval_df = pd.DataFrame({
> 'FeatA' : ['A1'] * 1000,
> 'FeatB' : ['B1'] * 1000,
> 'FeatC' : [4] * 1000,
> 'FeatD' : [200.43] * 1000 })
>
> train_stats = tfdv.generate_statistics_from_dataframe(train_df)
> schema = tfdv.infer_schema(statistics = train_stats)
> eval_stats = tfdv.generate_statistics_from_dataframe(eval_df)
> anomalies = tfdv.validate_statistics(statistics = eval_stats, schema = schema)
> tfdv.display_anomalies(anomalies)
The anomalies were detected only in FeatA and FeatB which are categorical ones. But in FeatC and FeatD, TFDV does not detect anything.
The result is shown in this image
I've tried also setting skew and drift comparators, but no changes. I guess it has to do with the auto-generated schema which has no domain mapped for the numerical features.
Anyone has any idea of how to get TFDV working for numerical features?
Upvotes: 1
Views: 1145
Reputation: 673
As explained by @durga, TFDV has added a new feature that allows us to detect skew for numeric features. Specify a jensen_shannon_divergence threshold instead of an infinity_norm threshold in the skew_comparator.
Example:
tfdv.get_feature(schema, 'total_actions').skew_comparator.jensen_shannon_divergence.threshold = 0.01
If you want to check for max and min value range, you need to manually set an inline FloatDomain
/IntDomain
in the Feature. It's not generated automatically by infer_schema():
Example:
tfdv.get_feature(schema, 'total_actions').int_domain.name = 'total_actions'
tfdv.get_feature(schema, 'total_actions').int_domain.min = 0
tfdv.get_feature(schema, 'total_actions').int_domain.max = 1400
Upvotes: 0
Reputation: 11
We need use jensen_shannon_divergence skew comparator for Numerical Features and infinity_norm for Categorical Features
tfdv.get_feature(schema_updated,'SALES').skew_comparator.jensen_shannon_divergence.threshold = 0.001
skew_anomalies = tfdv.validate_statistics(statistics=new_dataset_stats, schema=schema, serving_statistics=old_dataset_stats) display_anomalies(skew_anomalies)
Upvotes: 0
Reputation: 129
Normally, tfdv does not infer domains for numerical values, you have now 3 possible solutions:
1- Change the type of the dataframe column to str and thus it will be considered as a Bytes feature.
2- Add an int_domain (float_domain for FeatD) to your features and determine your desired min and max
3- Only for int features you can set int_domain.is_categorical to True, and then use a drift/skew comparator. You will be able to detect new values within the Top k value.
Upvotes: 0