Reputation: 378
I am trying to implement a Feature Selection component with the following plan in mind:
InputArtifact[Example]
as inputOutputArtifact[Example]
(which has the same structure but fewer columns)I am done with the first and second point, but am not able to figure out how to delete the selected columns directly in the TFRecord Dataset itself (which I am getting using tf.data.TFRecordDataset(train_uri, compression_type='GZIP')
)
Upvotes: 1
Views: 932
Reputation: 378
It took me some time to figure out (with the help of the blog linked by TensorFlow Support in the comments), but here is a workaround!
split_dataset = tf.data.TFRecordDataset("path_to_original_dataset.gzip", compression_type='GZIP')
with tf.io.TFRecordWriter(path = "path_to_new_TFRecord.gzip", options="GZIP") as writer:
for split_record in split_dataset.as_numpy_iterator():
example = tf.train.Example()
example.ParseFromString(split_record)
updated_example = update_example(selected_features, example)
writer.write(updated_example.SerializeToString())
Here, updated_example
is a custom function I used that takes the parsed example, processes it and returns the processed example!
# update example with selected features
def update_example(selected_features, orig_example):
result = {}
for key, feature in orig_example.features.feature.items():
if key in selected_features:
result[key] = orig_example.features.feature[key]
new_example = tf.train.Example(features=tf.train.Features(feature=result))
return new_example
Instead of deleting the column (as I wasn't able to find a way to do so), I just created a new example feature-by-feature and returned it!
Upvotes: 1