Reputation: 1
My aim is to use BERTopic for semi-supervised, guided topic modelling on a set of parliamentary speeches already broken down to sentence-level to figure out which mode of energy-production they're talking about. I used a rudimentary tfidf + cosine_similarity
combo to compute similarities between my sentences and my list of topic-specific keywords and assigned the associated labels to a subset of sentences which crossed a threshold similarity score, and followed convention by labelling the ambiguous sentences -1
.
In my most recent attempt I decided to separately create sentence-embeddings to see if the error goes away since my topic_model was also resulting in the same error when letting it use its default embedding model and parameters.
My docs
list contains sentences from my dataset in lower-case (I also tried to remove punctuations in some attempts). I am not sure if I am missing any key dependencies or perhaps I'm simply missing a crucial pre-processing step?
A snippet of the code I am trying to run:
docs = df['docs'].to_list()
assigned_labels = df['similarity_label'].to_list()
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True, batch_size=18)
The error stack I receive:
18 embedding_model = SentenceTransformer(\"all-MiniLM-L6-v2\")
---> 19 embeddings = embedding_model.encode(docs, show_progress_bar=True, batch_size=18)
484 sentences_batch = sentences_sorted[start_index : start_index + batch_size]
--> 485 features = self.tokenize(sentences_batch)
--> 922 return self._first_module().tokenize(texts)
152 batch1, batch2 = [], []
153 for text_tuple in texts:
--> 154 batch1.append(text_tuple[0])
155 batch2.append(text_tuple[1])
156 to_tokenize = [batch1, batch2]
TypeError: 'float' object is not subscriptable"
I don't understand where these floats are, and how can I deal with them?
Upvotes: 0
Views: 275