Bbrk24
Bbrk24

Reputation: 951

Topic modelling many documents with low memory overhead

I've been working on a topic modelling project using BERTopic 0.16.3, and the preliminary results were promising. However, as the project progressed and the requirements became apparent, I ran into a specific issue with scalability.

Specifically:

That last requirement necessitates batching the documents, since loading them all into memory at once requires linear memory. So, I've been looking into clustering algorithms that work with online topic modelling. BERTopic's documentation suggests scikit-learn's MiniBatchKMeans, but the results I'm getting from that aren't very good.

Some models I've looked at include:

The latter two also don't provide the predict method, limiting their utility.

I am fairly new to the subject, so perhaps I'm approaching this completely wrong and the immediate problem I'm trying to solve has no solution. So to be clear, at the base level, the question I'm trying to answer is: How do I perform topic modelling (and get good results) on a large number of documents without using too much memory?

Upvotes: 4

Views: 261

Answers (1)

Nick Becker
Nick Becker

Reputation: 4224

In general, advanced techniques like UMAP and HDBSCAN are helpful in producing high quality results on larger datasets but will take more memory. Unless it's absolutely required, you may want to consider relaxing this constraint for the sake of performance, real-world human time, and actual cost (hourly instance or otherwise).

At this scale for a workflow you expect to go to production, rather than trying to work around this in software it may be easier to switch hardware. The GPU-accelerated UMAP and HDBSCAN in cuML can handle this much data very quickly -- quick enough that it's probably worth considering renting a GPU-enabled system if you don't have one locally.

For the following example, I took a sample of one million Amazon reviews, encoded them into embeddings (384 dimensions), and used the GPU UMAP and HDBSCAN in the cuML v25.02 release. I ran this on a system with an H100 GPU.

from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
import pandas as pd
from cuml.manifold.umap import UMAP
from cuml.cluster import HDBSCAN

# https://amazon-reviews-2023.github.io/
df = pd.read_json("Electronics.jsonl.gz", lines=True, nrows=1000000)
reviews = df.text.tolist()

# Create embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(reviews, batch_size=1024, show_progress_bar=True)

reducer = UMAP(n_components=5)
%time reduced_embeddings = reducer.fit_transform(embeddings)
CPU times: user 1min 31s, sys: 9.69 s, total: 1min 40s
Wall time: 7.8 s

clusterer = HDBSCAN()
%time clusterer.fit(reduced_embeddings)
CPU times: user 27.7 s, sys: 1.72 s, total: 29.5 s
Wall time: 30.4 s

There's an example of how to run these steps on GPUs in the BERTopic FAQs.

I work on these projects at NVIDIA and am a community contributor to BERTopic, so if you run into any issues please let me know and file a Github issue.

Upvotes: 2

Related Questions