Pre-processing text for transfer learning using Glove Twitter word embeddings

Question

I'm using a pre-trained word embeddings model Glove Twitter 200 to fine-tune it with a twitter dataset that I have, the main goal is to convert each of the sentences in vectors (averaging the individual vectors) and then performing clustering over the entire dataset... and after use this trained model for generating word embeddings for new tweets as they come.

So, my question here is if there is any specific way to pre-process my dataset, for this model in specific, before the training step. I have checked the words in the pre-trained model and noticed that mentions, hashtags, cashtags, urls, etc, are all replaced by keywords ...

import gensim.downloader as api
wv = api.load('glove-twitter-200')

for index, word in enumerate(wv.index_to_key):
    if index == 10:
        break
    print(f"word #{index}/{len(wv.index_to_key)} is {word}")
>>
word #0/1193514 is 
word #1/1193514 is .
word #2/1193514 is :
word #3/1193514 is rt
word #4/1193514 is ,
word #5/1193514 is 
word #6/1193514 is 
word #7/1193514 is 
word #8/1193514 is 
word #9/1193514 is !

I've searched for "pre-process text for using glove-twitter-200" on Google but nothing relevant comes up, mostly all the pre-process is just removing these words, which I don't think it would be the right thing to do.

Pre-processing text for transfer learning using Glove Twitter word embeddings

Answers (1)

Related Questions