Esther
Esther

Reputation: 11

Support bigrams in Topic Modeling using Mallet Java Api

We would like to build a topic model with bigrams. What is the recommended way to implement this in Java?

Currently, we use Mallet Java API. Specifically, ParallelTopicModel while passing tokens as a string to data parameter of Instance object.

Thank you.

Upvotes: 1

Views: 314

Answers (1)

David Mimno
David Mimno

Reputation: 1911

The easiest and most reliable way to account for n-grams is to modify the input. For example, you might replace new york with new_york, and then tokenize using a pattern that accepts _ as a letter character. Mallet allows you to specify a file with strings to treat as single tokens when you import documents:

bin/mallet import-file --help
A tool for creating instance lists of feature vectors from comma-separated-values
...
--replacement-files FILE [FILE ...]
  files containing string replacements, one per line:
    'A B [tab] C' replaces A B with C,
    'A B' replaces A B with A_B
  Default is (null)

This mode of use requires you to identify specific n-grams. You could also modify the input file to include all bigrams, so to be or not to be would become to_be be_or or_not not_to to_be. I don't know whether that would produce anything useful.

There are also topic model variants that "natively" support n-gram identification, but at a significant cost in training time and model quality. I would not recommend using any of them.

Upvotes: 1

Related Questions