Reputation: 1161
I'm using a pre-trained word embeddings model Glove Twitter 200 to fine-tune it with a twitter dataset that I have, the main goal is to convert each of the sentences in vectors (averaging the individual vectors) and then performing clustering over the entire dataset... and after use this trained model for generating word embeddings for new tweets as they come.
So, my question here is if there is any specific way to pre-process my dataset, for this model in specific, before the training step. I have checked the words in the pre-trained model and noticed that mentions, hashtags, cashtags, urls, etc, are all replaced by keywords ...
import gensim.downloader as api
wv = api.load('glove-twitter-200')
for index, word in enumerate(wv.index_to_key):
if index == 10:
break
print(f"word #{index}/{len(wv.index_to_key)} is {word}")
>>
word #0/1193514 is <user>
word #1/1193514 is .
word #2/1193514 is :
word #3/1193514 is rt
word #4/1193514 is ,
word #5/1193514 is <repeat>
word #6/1193514 is <hashtag>
word #7/1193514 is <number>
word #8/1193514 is <url>
word #9/1193514 is !
I've searched for "pre-process text for using glove-twitter-200" on Google but nothing relevant comes up, mostly all the pre-process is just removing these words, which I don't think it would be the right thing to do.
Upvotes: 0
Views: 933
Reputation: 1161
I found the preprocessing function written in rubby from the standford website. For anyone looking, this is the link: https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb.
Upvotes: 0