antggt
antggt

Reputation: 21

Data augmentation for text classification

What is the current state of the art data augmentation technic about text classification?

I made some research online about how can I extend my training set by doing some data transformation, the same we do on image classification. I found some interesting ideas such as:

But nothing about using pre-trained word vector representation model such as word2vec. Is there a reason?

Data augmentation using a word2vec might help the model to get more data based on external information. For instance, replacing a toxic comment token randomly in the sentence by its closer token in a pre-trained vector space trained specifically on external online comments.

Is it a good method or do I miss some important drawbacks of this technic?

Upvotes: 1

Views: 1095

Answers (1)

greeness
greeness

Reputation: 16114

Your idea of using word2vec embedding usually helps. However, that is a context-free embedding. To go one step further, the state of the art (SOTA) as of today (2019-02) is to use a language model trained on large corpus of text and fine-tune your own classifier with your own training data.

The two SOTA models are:

These data augmentation methods you mentioned might also help (depends on your domain and the number of training examples you have). Some of them are actually used in the language model training (for example, in BERT there is one task to randomly mask out words in a sentence at pre-training time). If I were you I would first adopt a pre-trained model and fine tune your own classifier with your current training data. Taking that as a baseline, you could try each of the data augmentation method you like and see if they really help.

Upvotes: 1

Related Questions