Reputation: 11
We would like to build a topic model with bigrams. What is the recommended way to implement this in Java?
Currently, we use Mallet Java API. Specifically, ParallelTopicModel while passing tokens as a string to data parameter of Instance object.
Thank you.
Upvotes: 1
Views: 314
Reputation: 1911
The easiest and most reliable way to account for n-grams is to modify the input. For example, you might replace new york
with new_york
, and then tokenize using a pattern that accepts _
as a letter character. Mallet allows you to specify a file with strings to treat as single tokens when you import documents:
bin/mallet import-file --help
A tool for creating instance lists of feature vectors from comma-separated-values
...
--replacement-files FILE [FILE ...]
files containing string replacements, one per line:
'A B [tab] C' replaces A B with C,
'A B' replaces A B with A_B
Default is (null)
This mode of use requires you to identify specific n-grams. You could also modify the input file to include all bigrams, so to be or not to be
would become to_be be_or or_not not_to to_be
. I don't know whether that would produce anything useful.
There are also topic model variants that "natively" support n-gram identification, but at a significant cost in training time and model quality. I would not recommend using any of them.
Upvotes: 1