Reputation: 3708
I'd like my dictionary to know more of the words I use - and don't want to manually add all possible words as I end up typing them (I'm a biologist/bioinformatician - there's lots of jargon and specific software and species names). Instead I want to:
~/Library/Spelling/LocalDictionary
. But it would make sense to add them in the libreoffice/word/ispell custom dictionaries as well.1 and 3 are easy. How can I do 2? Thanks!
Upvotes: 0
Views: 111
Reputation: 3466
As far as I understand you want to remove duplicates (that already exist in the system dictionary). You might want to ask first, if this is really necessary, though. I guess they won't cause any problems and won't increase word-spell-checking excessively, so there is no real reason for step 2 in my opinion.
I think you'll have a much harder time with step 1. Extracting plain-text from a PDF may sound easy, but it certainly is not. You'll end up with plenty of unknown symbols. You need to fix split-words at the end of a line and you probably want to exclude equations/links/numbers/etc. before adding all these to your dictionary.
But if you have some tool to get this done and can create a couple of .txt files really containing only the words/sentences you need, then I would go with something like the following python code to "solve" the merge for your local dictionary only. Of course you can also extend this to load the system dictionary (wherever that is?) and merge it the same way I show below.
Please note that I left out any error handling on purpose.
Save as import_to_dict.py
, adjust the paths to your requirements and call with python import_to_dict.py
#!/usr/bin/env python
import os,re
# 1 - load existing dictionaries from files (adjust paths here!)
dictionary_file = '~/Library/Spelling/LocalDictionary'
global_dictionary_file = '/Library/Spelling/GlobalDictionary'
txt_file_folder = '~/Documents/ConvertedPapers'
reg_exp = r'[\s,.|/]+' #add symbols here
with open(local_dictionary_file, 'r') as f:
# splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
dictionary = set(re.split(reg_exp,f.read()))
with open(global_dictionary_file, 'r') as f:
# splitting with regular expressions shouldn't really be needed for the dictionary, but it should work
global_dictionary = set(re.split(reg_exp,f.read()))
# 2 - walk over all sub-dirs in your folder
for root, dirs, files in os.walk(txt_file_folder):
# open all files (this could easily be limited to only .txt files)
for file in files:
with open(os.path.join(root, file), 'r') as txt_f:
# read the file contents
words = txt_f.read()
# split into word-set (set guarantees no duplicates)
word_set = set(re.split(reg_exp,words))
# remove any already in dictionary existing words
missing_words = (word_set - dictionary) - global_dictionary
# add missing words to dictionary
dictionary |= missing_words
# 3 - write dictionary file
with open(dictionary_file, 'w') as f:
f.write('\n'.join(dictionary))
Upvotes: 1
Reputation: 326
Here is a basic java program that will generate a text file containing all of the unique words in a directory of plain text files, separated by a newline.
You can just replace the input directory and output file path strings with correct values for your system and run it.
import java.io.*;
import java.util.*;
public class MakeDictionary {
public static void main(String args[]) throws IOException {
Hashtable<String, Boolean> dictionary = new Hashtable<String, Boolean>();
String inputDir = "C:\\test";
String outputFile = "C:\\out\\dictionary.txt";
File[] files = new File(inputDir).listFiles();
BufferedWriter out = new BufferedWriter(new FileWriter(outputFile));
for (File file : files) {
if (file.isFile()) {
BufferedReader in = null;
try {
in = new BufferedReader(new FileReader(file.getCanonicalPath()));
String line;
while ((line = in.readLine()) != null) {
String[] words = line.split(" ");
for (String word : words) {
dictionary.put(word, true);
}
}
} finally {
if (in != null) {
in.close();
}
}
}
}
Set<String> wordset = dictionary.keySet();
Iterator<String> iter = wordset.iterator();
while(iter.hasNext()) {
out.write(iter.next());
out.newLine();
}
out.close();
}
}
Upvotes: 0