Furkan Gözükara
Furkan Gözükara

Reputation: 23870

How to obtain all possible words from given hunspell dictionary?

I would like to parse open office supporting hunspell formatted aff and dic files.

English aff and dic files can be downloaded from here for example : http://extensions.openoffice.org/en/project/english-dictionaries-apache-openoffice

I want to scan each line of the given .dic file and generate every possible word of the each line with the provided .aff file

How can i do that?

I have installed NHunspell framework but it does not have that feature : https://www.nuget.org/packages/NHunspell/

For example for the english language lets consider

make/UAGS

make can be make, made, makes, making etc

Now i need parser to give me all these combinations. How can i obtain them? Ty very much

So basically i want to scan each line of the dictionary and generate all possible words from the word of that line and i dont know how can i do that

I can also write my own parsers, but it seems to me rules are pretty complex and there are no detailed and easy documentation about this

Here what i want basically. The image explains very clearly

Giving analyze/ADSG, en.dic and en.aff file and obtaining all the following words

analyze, analyzes, analyzing, analyzed, reanalyze, reanalyzes, reanalyzing, reanalyzed

enter image description here

Upvotes: 8

Views: 3157

Answers (3)

SztupY
SztupY

Reputation: 10546

CSpell's hunspell-reader package allows you to get a full word list from a dictionary:

From their website:

Converting Hunspell to word list

To convert a Hunspell dictionary to a word list, you will need both the .dic and .aff files. For example en_US comes with two files: en_US.dic and en_US.aff. This tool assumes they are both in the same directory.

Assuming these files are in the current directory, the following command will write the words to en_US.txt.

hunspell-reader words ./en_US.dic -o en_US.txt

Upvotes: 0

Maëlan
Maëlan

Reputation: 4212

As Kartal Tabak pointed out, what you are looking for are the command-line tools wordforms and unmunch, which are distributed with Hunspell. But wordforms is for just one stem, and unmunch is very buggy. See this answer for alternatives.

Furthermore, it seems that Hunspell does not expose this feature as library functions. If you want to use this feature programmatically (as you mentioned C# and NHunspell), then you probably need to spawn these external programs and parse their output.

Upvotes: 0

Kartal Tabak
Kartal Tabak

Reputation: 894

If you want the entire database you may execute unmunch:

unmunch dictionary.dic dictionary.aff

Note that the current implementation of unmunch in hunspell has a limitation of maximum number of words, affs, and length of generated words. So, unmunch may fail if the target language is beyond the limits of unmunch.

If you want just the list of possible words that can be generated from an entry, you may use wordforms:

wordforms dictionary.aff dictionary.dic word

Upvotes: 9

Related Questions