Ole Steinar Skrede
Ole Steinar Skrede

Reputation: 103

How to handle very sparse vectors in Tensorflow

What is the best way to handle sparse vectors of size (about) 30 000, where all indexes are zero except one index with value one (1-HOT vector)?

In my dataset I have a sequence of values, that I convert to one 1-HOT vector for each value. Here is what i currently do:

# Create some queues to read data from .csv files
...
# Parse example(/line) from the data file
example = tf.decode_csv(value, record_defaults=record_defaults)

# example now looks like (e.g) [[5], [1], [4], [38], [571], [9]]
# [5] indicates the length of the sequence
# 1, 4, 38, 571 is the input sequence
# 4, 38, 571, 9 is the target sequence
# Create 1-HOT vectors for each value in the sequence
sequence_length = example[0]
one_hots = example[1:]
one_hots = tf.reshape(one_hots, [-1])
one_hots = tf.one_hot(one_hots, depth=n_classes)

# Grab the first values as the input features and the last values as target
features = one_hots[:-1]
targets = one_hots[1:]

...
# The sequence_length, features and targets are added to a list
# and the list is sent into a batch with tf.train_batch_join(...).
# So now I can get batches and feed into my RNN
...

This works, but I am convinced that it could be done in a more efficient way. I looked at SparseTensor, but I could not figure out how to create SparseTensors from the example tensor I get from tf.decode_csv. And I read somwhere that it is best to parse the data after it is retrieved as a batch, is this still true?

Here is a pastebin of the full code. From line 32 is my current way of creating 1-HOT vectors.

Upvotes: 2

Views: 2154

Answers (1)

Yuval Atzmon
Yuval Atzmon

Reputation: 5945

Instead of dealing with converting your inputs to sparse 1 hot vectors, it is preffered to use tf.nn.embedding_lookup, which simply selects the relevant rows of the matrix you would multiply by. This is equivalent for multiplication of the matrix by the 1-hot vector.

Here is a usage example

embed_dim = 3;
vocab_size = 10;
E = np.random.rand(vocab_size, embed_dim)
print E
embeddings = tf.Variable(E)
examples = tf.Variable(np.array([4,5, 2,9]).astype('int32'))

examples_embedded = tf.nn.embedding_lookup(embeddings, examples)

s = tf.InteractiveSession()
s.run(tf.initialize_all_variables())
print ''
print examples_embedded.eval()

Also see this example in im2txt project, for how to feed this kind of data for RNNs, (the line saying seq_embeddings = tf.nn.embedding_lookup(embedding_map, self.input_seqs))

Upvotes: 1

Related Questions