Reputation: 85
I am scraping reviews off Amazon with the intent to perform sentiment analysis to classify them into positive, negative and neutral. Now the data I would get would be text and unlabeled.
My approach to this problem would be as following:-
1.) Label the data using clustering algorithms like DBScan, HDBScan or KMeans. The number of clusters would obviously be 3.
2.) Train a Classification algorithm on the labelled data.
Now I have never performed clustering on text data but I am familiar with the basics of clustering. So my question is:
Is my approach correct?
Any articles/blogs/tutorials I can follow for text based clustering since I am kinda new to this?
Upvotes: 1
Views: 397
Reputation: 2056
I have never done such an experiment but as far as I know, the most challenging part of this work is transforming the sentences
or documents
into fixed-length vectors (mapping into semantic space). I highly suggest using a sentiment analysis pipeline from huggingface
library for embedding the sentences (in this way you might exploit some supervision). There are other options as well:
sentence-transformers
library. (straightforward and still good)After you reach this point (every review ==> fixed-length vector) you can exploit whatever you want to cluster them and look after the results.
Upvotes: 1