How to adjust the batch data by the amount of labels in PyTorch

Question

I have made n-grams / doc-ids for document classification,

def create_dataset(tok_docs, vocab, n):
  n_grams = []
  document_ids = []
  for i, doc in enumerate(tok_docs):
    for n_gram in [doc[0][i:i+n] for i in range(len(doc[0]) - 1)]:
       n_grams.append(n_gram)
       document_ids.append(i)
  return n_grams, document_ids

def create_pytorch_datasets(n_grams, doc_ids):
  n_grams_tensor = torch.tensor(n_grams)
  doc_ids_tensor = troch.tensor(doc_ids)
  full_dataset = TensorDataset(n_grams_tensor, doc_ids_tensor)
  return full_dataset

create_dataset returns pair of (n-grams, document_ids) like below:

n_grams, doc_ids = create_dataset( ... )
train_data = create_pytorch_datasets(n_grams, doc_ids)
>>> train_data[0:100]
(tensor([[2076, 517, 54, 3647, 1182, 7086],
         [517, 54, 3647, 1182, 7086, 1149],
         ...
         ]),
 tensor(([0, 0, 0, 0, 0, ..., 3, 3, 3]))

train_loader = DataLoader(train_data, batch_size = batch_size, shuffle = True)

The first of tensor content means n-grams and the second one does doc_id.

But as you know, by the length of documents, the amount of training data according to the label would changes.

If one document has very long length, there would be so many pairs that have its label in training data.

I think it can cause overfitting in model, because the classification model tends to classify inputs to long length documents.

So, I want to extract input batches from a uniform distribution for label (doc_ids). How can I fix it in code above?

p.s) If there is train_data like below, I want to extract batch by the probability like that:

  n-grams        doc_ids
([1, 2, 3, 4],      1)       ====> 0.33
([1, 3, 5, 7],      2)       ====> 0.33
([2, 3, 4, 5],      3)       ====> 0.33 * 0.25
([3, 5, 2, 5],      3)       ====> 0.33 * 0.25
([6, 3, 4, 5],      3)       ====> 0.33 * 0.25
([2, 3, 1, 5],      3)       ====> 0.33 * 0.25

Victor Zuanazzi · Accepted Answer

In pytorch you can specify a sampler or a batch_sampler to the dataloader to change how the sampling of datapoints is done.

docs on the dataloader: https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler

documentation on the sampler: https://pytorch.org/docs/stable/data.html#torch.utils.data.Sampler

For instance, you can use the WeightedRandomSampler to specify a weight to every datapoint. The weighting can be the inverse length of the document for instance.

I would make the following modifications in the code:

def create_dataset(tok_docs, vocab, n):
  n_grams = []
  document_ids = []
  weights = []  # << list of weights for sampling
  for i, doc in enumerate(tok_docs):
    for n_gram in [doc[0][i:i+n] for i in range(len(doc[0]) - 1)]:
       n_grams.append(n_gram)
       document_ids.append(i)
       weights.append(1/len(doc[0]))  # << ngrams of long documents are sampled less often
  return n_grams, document_ids, weights

sampler = WeightedRandomSampler(weights, 1, replacement=True) # << create the sampler

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=False, sampler=sampler)  # << includes the sampler in the dataloader

How to adjust the batch data by the amount of labels in PyTorch

Answers (1)

Related Questions