Reputation: 405
I recently extracted text data from a directory of pdf files. When reading pdfs, sometimes the text returned is a little messy.
For example, I can be looking at a string that says:
"T he administrati on is doing bad things, and not fulfilling what it prom ised"
I want the result to be:
"The administration is doing bad things, and not fulfilling what it promised"
I tested code (using pyenchant and wx) I found on stackoverflow here and it did not return what I wanted. My modifications were as follows:
a = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
chkr = enchant.checker.SpellChecker("en_US")
chkr.set_text(a)
for err in chkr:
sug = err.suggest()[0]
err.replace(sug)
c = chkr.get_text()#returns corrected text
print(c)
This code returns:
"T he administrate on is doing bad things, and not fulfilling what it prom side"
I'm using Python 3.5.x on a Windows 7 Enterprise, 64-bit. I would appreciate any suggestions!
Upvotes: 2
Views: 5044
Reputation: 2622
I have taken Generic Human’s answer, slightly modified it to solve your problem.
You need to copy these 125k words, sorted by frequency into a text file, name the file words-by-frequency.txt
.
from math import log
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
with open("words-by-frequency.txt") as f:
words = [line.strip() for line in f.readlines()]
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
Running the function with the input:
messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())
The administration is doing bad things and not fulfilling what it promised
>>>
Edit: The code below doesn't require the text file and works just for your input i.e., "T he administrati on is doing bad things, and not fulfilling what it prom ised"
from math import log
# Build a cost dictionary, assuming Zipf's law and cost = -math.log(probability).
words = ["the", "administration", "is", "doing", "bad",
"things", "and", "not", "fulfilling", "what",
"it", "promised"]
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
out = []
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
out.append(s[i-k:i])
i -= k
return " ".join(reversed(out))
messy_txt = "T he administrati on is doing bad things, and not fulfilling what it prom ised"
print(infer_spaces(messy_txt.lower().replace(' ', '').replace(',', '')).capitalize())
The administration is doing bad things and not fulfilling what it promised
>>>
I have just tried the above edit at repl.it and it printed the output as shown.
Upvotes: 2
Reputation: 1674
It looks like the enchant library you're using just isn't that good. It doesn't look for spelling mistakes across words, but instead just looks at words individually. I guess this makes sense since the function itself is called 'SpellChecker'.
The only thing I can think of is to look for better autocorrect libraries. Maybe this one might help? https://github.com/phatpiglet/autocorrect
No guarantees though.
Upvotes: 1