Reputation: 103
What is the best way to handle sparse vectors of size (about) 30 000, where all indexes are zero except one index with value one (1-HOT vector)?
In my dataset I have a sequence of values, that I convert to one 1-HOT vector for each value. Here is what i currently do:
# Create some queues to read data from .csv files
...
# Parse example(/line) from the data file
example = tf.decode_csv(value, record_defaults=record_defaults)
# example now looks like (e.g) [[5], [1], [4], [38], [571], [9]]
# [5] indicates the length of the sequence
# 1, 4, 38, 571 is the input sequence
# 4, 38, 571, 9 is the target sequence
# Create 1-HOT vectors for each value in the sequence
sequence_length = example[0]
one_hots = example[1:]
one_hots = tf.reshape(one_hots, [-1])
one_hots = tf.one_hot(one_hots, depth=n_classes)
# Grab the first values as the input features and the last values as target
features = one_hots[:-1]
targets = one_hots[1:]
...
# The sequence_length, features and targets are added to a list
# and the list is sent into a batch with tf.train_batch_join(...).
# So now I can get batches and feed into my RNN
...
This works, but I am convinced that it could be done in a more efficient way. I looked at SparseTensor, but I could not figure out how to create SparseTensors from the example
tensor I get from tf.decode_csv
. And I read somwhere that it is best to parse the data after it is retrieved as a batch, is this still true?
Here is a pastebin of the full code. From line 32 is my current way of creating 1-HOT vectors.
Upvotes: 2
Views: 2154
Reputation: 5945
Instead of dealing with converting your inputs to sparse 1 hot vectors, it is preffered to use tf.nn.embedding_lookup
, which simply selects the relevant rows of the matrix you would multiply by. This is equivalent for multiplication of the matrix by the 1-hot vector.
Here is a usage example
embed_dim = 3;
vocab_size = 10;
E = np.random.rand(vocab_size, embed_dim)
print E
embeddings = tf.Variable(E)
examples = tf.Variable(np.array([4,5, 2,9]).astype('int32'))
examples_embedded = tf.nn.embedding_lookup(embeddings, examples)
s = tf.InteractiveSession()
s.run(tf.initialize_all_variables())
print ''
print examples_embedded.eval()
Also see this example in im2txt project, for how to feed this kind of data for RNNs, (the line saying seq_embeddings = tf.nn.embedding_lookup(embedding_map, self.input_seqs)
)
Upvotes: 1