Data augmentation for text classification

Question

What is the current state of the art data augmentation technic about text classification?

I made some research online about how can I extend my training set by doing some data transformation, the same we do on image classification. I found some interesting ideas such as:

Synonym Replacement: Randomly choose n words from the sentence that does not stop words. Replace each of these words with one of its synonyms chosen at random.
Random Insertion: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random place in the sentence. Do this n times.
Random Swap: Randomly choose two words in the sentence and swap their positions. Do this n times.
Random Deletion: Randomly remove each word in the sentence with probability p.

But nothing about using pre-trained word vector representation model such as word2vec. Is there a reason?

Data augmentation using a word2vec might help the model to get more data based on external information. For instance, replacing a toxic comment token randomly in the sentence by its closer token in a pre-trained vector space trained specifically on external online comments.

Is it a good method or do I miss some important drawbacks of this technic?

Data augmentation for text classification

Answers (1)

Related Questions