Reputation: 41
I followed this tutorial http://www.programmersought.com/article/2609385756/
to create a TabularDataset with data that already tokenized and converted to ids and I do not want to use vocab or build vocab because the data is numerical
so I defined my field variable as:
myField = Field(tokenize= x_tokenize, use_vocab=False, sequential=True)
train,val, test = data.TabularDataset.splits(path='./', train=train_path, validation=valid_path, test=test_path ,format='csv', fields=data_fields, skip_header=True)
train output:
print(vars(train[0])['src'])
#output this [101, 3177, 3702, 11293, 1116, 102]
and I used a BucketIterator:
train_iter= BucketIterator(train,
batch_size=BATCH_SIZE,
device = DEVICE,
sort_key=lambda x: (len(x.src), len(x.trg)),
train=True,
batch_size_fn=batch_size_fn,
repeat=False)
when I run this code:
batch = next(iter(train_iter))
I got TypeError: an integer is required (got type list)
TypeError Traceback (most recent call last) in () ----> 1 batch = next(iter(train_iter))
3 frames /usr/local/lib/python3.6/dist-packages/torchtext/data/iterator.py in iter(self) 155 else: 156 minibatch.sort(key=self.sort_key, reverse=True) --> 157 yield Batch(minibatch, self.dataset, self.device) 158 if not self.repeat: 159 return
/usr/local/lib/python3.6/dist-packages/torchtext/data/batch.py in init(self, data, dataset, device) 32 if field is not None: 33 batch = [getattr(x, name) for x in data] ---> 34 setattr(self, name, field.process(batch, device=device)) 35 36 @classmethod
/usr/local/lib/python3.6/dist-packages/torchtext/data/field.py in process(self, batch, device) 199 """ 200 padded = self.pad(batch) --> 201 tensor = self.numericalize(padded, device=device) 202 return tensor 203
/usr/local/lib/python3.6/dist-packages/torchtext/data/field.py in numericalize(self, arr, device) 321 arr = self.postprocessing(arr, None) 322 --> 323 var = torch.tensor(arr, dtype=self.dtype, device=device) 324 325 if self.sequential and not self.batch_first:
TypeError: an integer is required (got type list)
Upvotes: 1
Views: 1065
Reputation: 51
You have to provide the pad_token while declaring the Field.
Change this
myField = Field(tokenize= x_tokenize, use_vocab=False, sequential=True)
to
myField = Field(tokenize= x_tokenize, use_vocab=False, sequential=True, pad_token=0)
Upvotes: 0