Yannick Wurm
Yannick Wurm

Reputation: 3708

How do I autoscan local documents to add words to a custom dictionary?

I'd like my dictionary to know more of the words I use - and don't want to manually add all possible words as I end up typing them (I'm a biologist/bioinformatician - there's lots of jargon and specific software and species names). Instead I want to:

  1. Take a directory of existing documents. These are PDFs or Word/latex documents of scientific articles; I guess they could be "easily" be converted to plain text.
  2. Pull out all words that are not in the "normal" dictionary.
  3. Add these to my local custom dictionary (on my mac that's ~/Library/Spelling/LocalDictionary. But it would make sense to add them in the libreoffice/word/ispell custom dictionaries as well.

1 and 3 are easy. How can I do 2? Thanks!

Upvotes: 0

Views: 111

Answers (2)

Chris
Chris

Reputation: 3466

As far as I understand you want to remove duplicates (that already exist in the system dictionary). You might want to ask first, if this is really necessary, though. I guess they won't cause any problems and won't increase word-spell-checking excessively, so there is no real reason for step 2 in my opinion.

I think you'll have a much harder time with step 1. Extracting plain-text from a PDF may sound easy, but it certainly is not. You'll end up with plenty of unknown symbols. You need to fix split-words at the end of a line and you probably want to exclude equations/links/numbers/etc. before adding all these to your dictionary.

But if you have some tool to get this done and can create a couple of .txt files really containing only the words/sentences you need, then I would go with something like the following python code to "solve" the merge for your local dictionary only. Of course you can also extend this to load the system dictionary (wherever that is?) and merge it the same way I show below.

Please note that I left out any error handling on purpose.

Save as import_to_dict.py, adjust the paths to your requirements and call with python import_to_dict.py

#!/usr/bin/env python

import os,re

# 1 - load existing dictionaries from files (adjust paths here!)
dictionary_file = '~/Library/Spelling/LocalDictionary'
global_dictionary_file = '/Library/Spelling/GlobalDictionary'
txt_file_folder = '~/Documents/ConvertedPapers'

reg_exp = r'[\s,.|/]+' #add symbols here

with open(local_dictionary_file, 'r') as f:
    # splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
    dictionary = set(re.split(reg_exp,f.read()))

with open(global_dictionary_file, 'r') as f:
    # splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
    global_dictionary = set(re.split(reg_exp,f.read()))

# 2 - walk over all sub-dirs in your folder
for root, dirs, files in os.walk(txt_file_folder):
    # open all files (this could easily be limited to only .txt files)
    for file in files:
        with open(os.path.join(root, file), 'r') as txt_f:
            # read the file contents
            words = txt_f.read()
            # split into word-set (set guarantees no duplicates)
            word_set = set(re.split(reg_exp,words))
            # remove any already in dictionary existing words
            missing_words = (word_set - dictionary) - global_dictionary
            # add missing words to dictionary
            dictionary |= missing_words

# 3 - write dictionary file
with open(dictionary_file, 'w') as f:
    f.write('\n'.join(dictionary))

Upvotes: 1

dbenson
dbenson

Reputation: 326

Here is a basic java program that will generate a text file containing all of the unique words in a directory of plain text files, separated by a newline.

You can just replace the input directory and output file path strings with correct values for your system and run it.

import java.io.*;
import java.util.*;

public class MakeDictionary {
    public static void main(String args[]) throws IOException {
        Hashtable<String, Boolean> dictionary = new Hashtable<String, Boolean>();

        String inputDir = "C:\\test";
        String outputFile = "C:\\out\\dictionary.txt";


        File[] files = new File(inputDir).listFiles();

        BufferedWriter out = new BufferedWriter(new FileWriter(outputFile));
        for (File file : files) {
            if (file.isFile()) {
                BufferedReader in = null;
                try {
                    in = new BufferedReader(new FileReader(file.getCanonicalPath()));
                    String line;
                    while ((line = in.readLine()) != null) {
                        String[] words = line.split(" ");
                        for (String word : words) {
                            dictionary.put(word, true);
                        }
                    }
                } finally {
                    if (in != null) {
                        in.close();
                    }
                }
            }
        }

        Set<String> wordset = dictionary.keySet();
        Iterator<String> iter = wordset.iterator();
        while(iter.hasNext()) {
            out.write(iter.next());
            out.newLine();
        }
        out.close();
    }
}

Upvotes: 0

Related Questions