Michael
Michael

Reputation: 31

How do I the result of a split operation?

I am attempting to split a string into words, then split each resulting word into a list of characters. Ultimately, I have a file with one example per line, and I would like each line split into words which are in turn split into characters.

sess = tf.Session()

string = tf.constant(['This is the string I would like to split.'], dtype=tf.string)
words = tf.string_split(string)

print words.eval(session=sess)

Results in

SparseTensorValue(indices=array([[0, 0],
   [0, 1],
   [0, 2],
   [0, 3],
   [0, 4],
   [0, 5],
   [0, 6],
   [0, 7],
   [0, 8]]), values=array(['This', 'is', 'the', 'string', 'I', 'would', 'like', 'to',
   'split.'], dtype=object), dense_shape=array([1, 9]))

Now, I would like a SparseTensor representing the jagged array, where each row is a word, and the columns are its characters. I've tried somthings like:

def split_word(word):
    word = tf.expand_dims(word, axis=0)
    word = tf.string_split(word, delimiter='')
    return word.values 

split_words = tf.map_fn(split_word, words.values)

But that does not work, because map_fn builds a TensorArray, and the shapes have to match. Is there a clean way to accomplish this?

Upvotes: 1

Views: 1495

Answers (2)

Michael
Michael

Reputation: 31

I've ended up using a tf.while_loop within a Dataset.map. The following is a working example that reads a file with one example per line. It's not very elegant, but it accomplishes the goal.

import tensorflow as tf

def split_line(line):
    # Split the line into words
    line = tf.expand_dims(line, axis=0)
    line = tf.string_split(line, delimiter=' ')

    # Loop over the resulting words, split them into characters, and stack them back together
    def body(index, words):                                                         
        next_word = tf.sparse_slice(line, start=tf.to_int64(index), size=[1, 1]).values
        next_word = tf.string_split(next_word, delimiter='')
        words = tf.sparse_concat(axis=0, sp_inputs=[words, next_word], expand_nonconcat_dim=True)
        return index+[0, 1], words
    def condition(index, words):           
        return tf.less(index[1], tf.size(line))

    i0 = tf.constant([0,1]) 
    first_word = tf.string_split(tf.sparse_slice(line, [0,0], [1, 1]).values, delimiter='')
    _, line = tf.while_loop(condition, body, loop_vars=[i0, first_word], back_prop=False) 

    # Convert to dense              
    return tf.sparse_tensor_to_dense(line, default_value=' ')

dataset = tf.data.TextLineDataset(['./example.txt'])
dataset = dataset.map(split_line)
iterator = dataset.make_initializable_iterator()
parsed_line = iterator.get_next()

sess = tf.Session()
sess.run(iterator.initializer)
for example in range(3):       
    print sess.run(parsed_line)
    print

Results in

[['T' 'h' 'i' 's' ' ']
 ['i' 's' ' ' ' ' ' ']
 ['t' 'h' 'e' ' ' ' ']
 ['f' 'i' 'r' 's' 't']
 ['l' 'i' 'n' 'e' '.']]

[['A' ' ' ' ' ' ' ' ' ' ' ' ' ' ']
 ['s' 'e' 'c' 'o' 'n' 'd' ' ' ' ']
 ['e' 'x' 'a' 'm' 'p' 'l' 'e' '.']]

[['T' 'h' 'i' 'r' 'd' '.']]

Upvotes: 1

David Parks
David Parks

Reputation: 32071

This sounds like preprocessing, you will be much better off using the Dataset preprocessing pipeline.

https://www.tensorflow.org/programmers_guide/datasets

You'll start by importing the raw strings. Then use a tf.Dataset().map(...) to map the string to a variable length array of word tensors. I just did this a few days ago and posted an example on this question:

In Tensorflow's Dataset API how do you map one element into multiple elements?

You'll want to follow that with tf.Dataset().flat_map(...) to flatten the variable-length row of word tokens into individual samples.

The Dataset pipeline is new in TF 1.4, and appears to be the way pipelining will be handled in tensorflow, so it'll be worth the effort to learn.

This question might also be useful to you, I ran into it while doing something similar to what you are doing. Don't start with this question if you're just starting with the TF pipeline, but you might find it useful along the way.

Using tensorflow's Dataset pipeline, how do I *name* the results of a `map` operation?

Upvotes: 0

Related Questions