Reputation: 31
I am attempting to split a string into words, then split each resulting word into a list of characters. Ultimately, I have a file with one example per line, and I would like each line split into words which are in turn split into characters.
sess = tf.Session()
string = tf.constant(['This is the string I would like to split.'], dtype=tf.string)
words = tf.string_split(string)
print words.eval(session=sess)
Results in
SparseTensorValue(indices=array([[0, 0],
[0, 1],
[0, 2],
[0, 3],
[0, 4],
[0, 5],
[0, 6],
[0, 7],
[0, 8]]), values=array(['This', 'is', 'the', 'string', 'I', 'would', 'like', 'to',
'split.'], dtype=object), dense_shape=array([1, 9]))
Now, I would like a SparseTensor
representing the jagged array, where each row is a word, and the columns are its characters. I've tried somthings like:
def split_word(word):
word = tf.expand_dims(word, axis=0)
word = tf.string_split(word, delimiter='')
return word.values
split_words = tf.map_fn(split_word, words.values)
But that does not work, because map_fn
builds a TensorArray
, and the shapes have to match. Is there a clean way to accomplish this?
Upvotes: 1
Views: 1495
Reputation: 31
I've ended up using a tf.while_loop
within a Dataset.map
. The following is a working example that reads a file with one example per line. It's not very elegant, but it accomplishes the goal.
import tensorflow as tf
def split_line(line):
# Split the line into words
line = tf.expand_dims(line, axis=0)
line = tf.string_split(line, delimiter=' ')
# Loop over the resulting words, split them into characters, and stack them back together
def body(index, words):
next_word = tf.sparse_slice(line, start=tf.to_int64(index), size=[1, 1]).values
next_word = tf.string_split(next_word, delimiter='')
words = tf.sparse_concat(axis=0, sp_inputs=[words, next_word], expand_nonconcat_dim=True)
return index+[0, 1], words
def condition(index, words):
return tf.less(index[1], tf.size(line))
i0 = tf.constant([0,1])
first_word = tf.string_split(tf.sparse_slice(line, [0,0], [1, 1]).values, delimiter='')
_, line = tf.while_loop(condition, body, loop_vars=[i0, first_word], back_prop=False)
# Convert to dense
return tf.sparse_tensor_to_dense(line, default_value=' ')
dataset = tf.data.TextLineDataset(['./example.txt'])
dataset = dataset.map(split_line)
iterator = dataset.make_initializable_iterator()
parsed_line = iterator.get_next()
sess = tf.Session()
sess.run(iterator.initializer)
for example in range(3):
print sess.run(parsed_line)
print
Results in
[['T' 'h' 'i' 's' ' ']
['i' 's' ' ' ' ' ' ']
['t' 'h' 'e' ' ' ' ']
['f' 'i' 'r' 's' 't']
['l' 'i' 'n' 'e' '.']]
[['A' ' ' ' ' ' ' ' ' ' ' ' ' ' ']
['s' 'e' 'c' 'o' 'n' 'd' ' ' ' ']
['e' 'x' 'a' 'm' 'p' 'l' 'e' '.']]
[['T' 'h' 'i' 'r' 'd' '.']]
Upvotes: 1
Reputation: 32071
This sounds like preprocessing, you will be much better off using the Dataset
preprocessing pipeline.
https://www.tensorflow.org/programmers_guide/datasets
You'll start by importing the raw strings. Then use a tf.Dataset().map(...)
to map the string to a variable length array of word tensors. I just did this a few days ago and posted an example on this question:
In Tensorflow's Dataset API how do you map one element into multiple elements?
You'll want to follow that with tf.Dataset().flat_map(...)
to flatten the variable-length row of word tokens into individual samples.
The Dataset
pipeline is new in TF 1.4, and appears to be the way pipelining will be handled in tensorflow, so it'll be worth the effort to learn.
This question might also be useful to you, I ran into it while doing something similar to what you are doing. Don't start with this question if you're just starting with the TF pipeline, but you might find it useful along the way.
Using tensorflow's Dataset pipeline, how do I *name* the results of a `map` operation?
Upvotes: 0