Eagle
Eagle

Reputation: 378

Delete a column from TFRecord Dataset (for feature selection)

I am trying to implement a Feature Selection component with the following plan in mind:

The implementation

I am done with the first and second point, but am not able to figure out how to delete the selected columns directly in the TFRecord Dataset itself (which I am getting using tf.data.TFRecordDataset(train_uri, compression_type='GZIP'))

Upvotes: 1

Views: 932

Answers (1)

Eagle
Eagle

Reputation: 378

It took me some time to figure out (with the help of the blog linked by TensorFlow Support in the comments), but here is a workaround!

split_dataset = tf.data.TFRecordDataset("path_to_original_dataset.gzip", compression_type='GZIP')
with tf.io.TFRecordWriter(path = "path_to_new_TFRecord.gzip", options="GZIP") as writer:
      for split_record in split_dataset.as_numpy_iterator():
        example = tf.train.Example()
        example.ParseFromString(split_record)

        updated_example = update_example(selected_features, example)

        writer.write(updated_example.SerializeToString())

Here, updated_example is a custom function I used that takes the parsed example, processes it and returns the processed example!

# update example with selected features
def update_example(selected_features, orig_example):
  result = {}
  for key, feature in orig_example.features.feature.items():
    if key in selected_features:
      result[key] = orig_example.features.feature[key]
    
  new_example = tf.train.Example(features=tf.train.Features(feature=result))
  return new_example

Instead of deleting the column (as I wasn't able to find a way to do so), I just created a new example feature-by-feature and returned it!

Upvotes: 1

Related Questions