Reputation: 10060
I have a dataset that comes in as a tf.data.Dataset
from the new tf.Datasets
module. Of course the tf.data.Dataset
is an iterator over examples, but I need to actually convert this iterator into a full tensor containing all of the data loaded into memory. I am working with textual data and in order to extract the vocabulary of the corpus for Tokenization, I actually need the entire corpus of text at once.
I can of course write a loop to do this, but I was wondering if there was a more vectorized or faster way to implement the same task. Thanks.
I can at least provide the beginnings of the code. Note I am using Tensorflow 2.0a to try and get ready for the changeover:
import tensorflow_datasets as tfds
# Download the data
imdb_builder = tfds.builder('imdb_reviews')
imdb_builder.download_and_prepare()
# Setup training test split
imdb_train = imdb_builder.as_dataset(split=tfds.Split.TRAIN)
imdb_test = imdb_builder.as_dataset(split=tfds.Split.TEST)
# Look at the specs on the dataset if you wish
# print(imdb_builder.info)
To look at a single example. Observe that the data is un-tokenized.
a, = imdb_train.take(1)
print(a['text'])
tf.Tensor(b"As a lifelong fan of Dickens, I have ...", shape=(), dtype=string)
This is where I got stuck. Note that when trying to create the iterator over this dataset I obtained an error:
iter = imdb_train.batch(10).repeat(1).make_one_shot_iterator()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-35-1bf70c474a05> in <module>()
----> 1 imdb_train = imdb_train.batch(10).repeat(1).make_one_shot_iterator()
AttributeError: 'RepeatDataset' object has no attribute 'make_one_shot_iterator'
Upvotes: 2
Views: 5325
Reputation: 24691
Using tfds.load
is simpler and more compact:
import tensorflow_datasets as tfds
train = tfds.load("imdb_reviews", as_supervised=True, split=tfds.Split.TRAIN)
Pretty simple, you may want to start indexing from zero.
class Tokenizer:
def __init__(self):
self.vocab = {}
self._counter: int = 1
self.tokenizer = tfds.features.text.Tokenizer()
def __call__(self, text):
# Haven't found anything working with tf.tensor, oh sweet irony
tokens = self.tokenizer.tokenize(text.numpy())
for token in tokens:
if not token in self.vocab:
self.vocab[token] = self._counter
self._counter += 1
TBH it's a shame there is no tokenizer
-like utility for plain tensors and I need to convert them like that, but oh well, it's still in the alpha stage.
Since TF2.0
and it's eager
mode you can iterate with one_shot_iterator
and other strange ideas comfortably using loop:
tokenizer = Tokenizer()
for text, _ in train:
tokenizer(text)
Important: You don't have to load everything into the memory as it's an iterator. Though you may encounter problems with memory in vocab
for really large corpuses.
Printing items and their indices:
print(list(tokenizer.vocab.keys())[:10])
print(list(tokenizer.vocab.values())[:10])
Gives us:
['This', 'was', 'soul', 'provoking', 'I', 'am', 'an', 'Iranian', 'and', 'living']
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Upvotes: 2