Reputation: 6197
Say that I have a Tf.record file, and each row in the tf.records contain ints that are 0 or positive, and then padded with -1 so that all the rows are even size. So something like
0 3 43 223 23 -1 -1 -1
4 12 3 11 435 2 4 -1
9 3 11 32 34 322 9 7
.
.
.
How do I randomly select 3 numbers from each of the rows ?
The numbers will act like indexes to look up values in an embedding matrix, and then those embeddings will be averaged (basically word2vec CBOW model).
More specifically, how do I avoid selecting the padding values of '-1'. -1 is just what I used to pad my rows so that each row will be the same size in order to use tf.record.(If there is a way to use varying length rows in tfrecords, let me know).
Upvotes: 1
Views: 370
Reputation: 164
I think you're looking for something like tf.VarLenFeature(), more specifically, you do not necessarily have to pad your rows prior to creating the tfrecord file. You can create the tf_example,
from tensorflow.train import BytesList, Feature, Features, Example, Int64List
tf_example = Example(
features=Features(
feature={
"my_feature": Feature(
int64_list=Int64List(value=[0,3,43,223,23])
)
})
)
)
with TFRecordWriter(tfrecord_file_path) as tf_writer:
tf_writer.write(tf_example.SerializeToString())
Do this for all of your rows, that can vary in length.
You'll parse the tf_examples with something like,
def parse_tf_example(example):
feature_spec = {
"my_feature": tf.VarLenFeature(dtype=tf.int64)
}
return tf.parse_example([example], features=feature_spec)
Now, this will return your features as tf.SparseTensors, if you don't want to deal with that at this stage, and carry on using tensor ops as you would normally, you can simply use tf.sparse_tensor_to_dense() and carry on as you normally would with tensors.
The returned dense tensors will be of varying lengths, so you shouldn't have to worry about selecting '-1's, there won't be any. Unless you convert the sparse tensors to dense in batches, in that case the batches will be padded to the length of the longest tensor in the batch, and the padding value can be set by the default_value
parameter.
That is in so far as your question about using varying length rows in tfrecords and getting back varying length tensors.
With regards to the lookup op, I haven't used it myself, but I think tf.nn.embedding_lookup_sparse() might help you out here, it offers the ability to lookup the embeddings from the sparse tensor, forgoing the need to convert it to a dense tensor first, and also has a combiner
parameter to specify a reduction op on those embeddings, which in your case would be 'mean'.
I hope this helps in some way, good luck.
Upvotes: 1