Reputation: 843
Consider the following CSV file (example.csv
)
animal,size,weight,category
lion,large,200,mammal
ostrich,large,150,bird
sparrow,small,0.1,bird
whale,large,3000,mammal
bat,small,0.2,mammal
snake,small,1,reptile
condor,medium,12,bird
The goal is to convert all the categorical values into one-hot encodings. The standard way to do this in Tensorflow 2.0 is to use tf.data
. Following that example, the code to deal with the dataset above is
import collections
import tensorflow as tf
# Load the dataset.
dataset = tf.data.experimental.make_csv_dataset(
'example.csv',
batch_size=5,
num_epochs=1,
shuffle=False)
# Specify the vocabulary for each category.
categories = collections.OrderedDict()
categories['animal'] = ['lion', 'ostrich', 'sparrow', 'whale', 'bat', 'snake', 'condor']
categories['size'] = ['large', 'medium', 'small']
categories['category'] = ['mammal', 'reptile', 'bird']
# Define the categorical feature columns.
categorical_columns = []
for feature, vocab in categories.items():
cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
key=feature, vocabulary_list=vocab)
categorical_columns.append(tf.feature_column.indicator_column(cat_col))
# Retrieve the first batch and apply the one-hot encoding to it.
iterator = iter(dataset)
first_batch = next(iterator)
categorical_layer = tf.keras.layers.DenseFeatures(categorical_columns)
print(categorical_layer(first_batch).numpy())
Running the code above, one gets
[[1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0.]
[0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0.]
[0. 0. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1.]
[0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1.]]
where it looks like the two last columns size
and category
have been flipped, despite the fact that categories
is an ordered dictionary and the pre-existing order of the columns in the actual dataset. It's as if tf.feature_column.categorical_column_with_vocabulary_list()
did some unwarranted alphabetical sorting of the columns.
What's the reason for the above. Is this really the best way to do one-hot encoding in the spirit of tf.data
?
Upvotes: 1
Views: 470
Reputation: 2452
The sorting isn't occuring at tf.feature_column.categorical_column_with_vocabulary_list()
. If you print categorical_columns
, you will see that the columns are still in the order you added them to the feature_column:
[
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='animal', vocabulary_list=('lion', 'ostrich', 'sparrow', 'whale', 'bat', 'snake', 'condor'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='size', vocabulary_list=('large', 'medium', 'small'), dtype=tf.string, default_value=-1, num_oov_buckets=0)),
IndicatorColumn(categorical_column=VocabularyListCategoricalColumn(key='category', vocabulary_list=('mammal', 'reptile', 'bird'), dtype=tf.string, default_value=-1, num_oov_buckets=0))
]
The sorting occurs in the tf.keras.layers.DenseFeatures object.
In the code, you can see where the sorting occurs here (I found this by tracing the class inheritance from the tf.keras.layers.DenseFeatures class to the tensorflow.python.feature_column.dense_features.DenseFeatures class to the tensorflow.python.feature_column.feature_column_v2._BaseFeaturesLayer class to the _normalize_feature_columns function).
So why is it sorted? Elsewhere in the same file containing the _normalize_feature_columns
function (which is the function where the data is sorted), there is a similar sorting function with this comment:
# Sort the columns so the default collection name is deterministic even if the
# user passes columns from an unsorted collection, such as dict.values().
I think this explanation applies to why columns are sorted when using the tf.keras.layers.DenseFeatures
class too. Your columns and data are are consistent, but tensorflow doesn't assume that the input will be consistent, so it sorts it to ensure a consistent order.
Upvotes: 2