krishnab
krishnab

Reputation: 10060

Tensorflow: convert a `tf.data.Dataset` iterator to a Tensor

I have a dataset that comes in as a tf.data.Dataset from the new tf.Datasets module. Of course the tf.data.Dataset is an iterator over examples, but I need to actually convert this iterator into a full tensor containing all of the data loaded into memory. I am working with textual data and in order to extract the vocabulary of the corpus for Tokenization, I actually need the entire corpus of text at once.

I can of course write a loop to do this, but I was wondering if there was a more vectorized or faster way to implement the same task. Thanks.

I can at least provide the beginnings of the code. Note I am using Tensorflow 2.0a to try and get ready for the changeover:

import tensorflow_datasets as tfds

# Download the data
imdb_builder = tfds.builder('imdb_reviews')
imdb_builder.download_and_prepare()

# Setup training test split
imdb_train = imdb_builder.as_dataset(split=tfds.Split.TRAIN)
imdb_test = imdb_builder.as_dataset(split=tfds.Split.TEST)

# Look at the specs on the dataset if you wish
# print(imdb_builder.info)

To look at a single example. Observe that the data is un-tokenized.

a, = imdb_train.take(1)
print(a['text'])

tf.Tensor(b"As a lifelong fan of Dickens, I have ...", shape=(), dtype=string)

This is where I got stuck. Note that when trying to create the iterator over this dataset I obtained an error:

iter = imdb_train.batch(10).repeat(1).make_one_shot_iterator()

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

<ipython-input-35-1bf70c474a05> in <module>()
----> 1 imdb_train = imdb_train.batch(10).repeat(1).make_one_shot_iterator()

AttributeError: 'RepeatDataset' object has no attribute 'make_one_shot_iterator'

Upvotes: 2

Views: 5325

Answers (1)

Szymon Maszke
Szymon Maszke

Reputation: 24691

1. Data Loading

Using tfds.load is simpler and more compact:

import tensorflow_datasets as tfds

train = tfds.load("imdb_reviews", as_supervised=True, split=tfds.Split.TRAIN)

2. Vocabulary saver

Pretty simple, you may want to start indexing from zero.

class Tokenizer:
    def __init__(self):
        self.vocab = {}
        self._counter: int = 1
        self.tokenizer = tfds.features.text.Tokenizer()

    def __call__(self, text):
        # Haven't found anything working with tf.tensor, oh sweet irony
        tokens = self.tokenizer.tokenize(text.numpy())
        for token in tokens:
            if not token in self.vocab:
                self.vocab[token] = self._counter
                self._counter += 1

TBH it's a shame there is no tokenizer-like utility for plain tensors and I need to convert them like that, but oh well, it's still in the alpha stage.

3. Tokenize your data

Since TF2.0 and it's eager mode you can iterate with one_shot_iterator and other strange ideas comfortably using loop:

tokenizer = Tokenizer()

for text, _ in train:
    tokenizer(text)

Important: You don't have to load everything into the memory as it's an iterator. Though you may encounter problems with memory in vocab for really large corpuses.

4. Results

Printing items and their indices:

print(list(tokenizer.vocab.keys())[:10])
print(list(tokenizer.vocab.values())[:10])

Gives us:

['This', 'was', 'soul', 'provoking', 'I', 'am', 'an', 'Iranian', 'and', 'living']
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

Upvotes: 2

Related Questions