steca
steca

Reputation: 87

Remove whitespace between two lowercase letters

Trying to find a regex (or different method), that removes whitespace in a string only if it occurs between two lowercase letters. I'm doing this because I'm cleaning noisy text from scans where whitespace was mistakenly added inside of words.

For example, I'd like to turn the string noisy = "Hel lo, my na me is Mark." into clean= "Hello, my name is Mark."

I've tried to capture the group in a regex (see below) but don't know how to then replace only whitespace in between two lowercase letters. Same issue with re.sub.

This is what I've tried, but it doesn't work because it removes all the whitespace from the string:

import re

noisy = "Hel lo my name is Mark"

finder = re.compile("[a-z](\s)[a-z]")
whitesp = finder.search(noisy).group(1)
clean = noisy.replace(whitesp,"")

print(clean)

Any ideas are appreciated thanks!

EDIT 1: My use case is for Swedish words and sentences that I have OCR'd from scanned documents.

Upvotes: 1

Views: 306

Answers (3)

megan
megan

Reputation: 11

To correct an entire string, you could try symspellpy.

First, install it using pip:

python -m pip install -U symspellpy

Then, import the required packages, and load dictionaries. Dictionary files shipped with symspellpy can be accessed using pkg_resources. You can pass your string through the lookup_compound function, which will return a list of spelling suggestions (SuggestItem objects). Words that require no change will still be included in this list. max_edit_distance refers to the maximum edit distance for doing lookups (per single word, not entire string). You can maintain casing by setting transfer_casing to True. To get the clean string, a simple join statement with a little list comprehension does the trick.

import pkg_resources
from symspellpy import SymSpell

sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)

dictionary_path = pkg_resources.resource_filename(
    "symspellpy",
    "frequency_dictionary_en_82_765.txt"
)

bigram_path = pkg_resources.resource_filename(
    "symspellpy",
    "frequency_bigramdictionary_en_243_342.txt"
)

sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)
sym_spell.load_bigram_dictionary(bigram_path, term_index=0, count_index=2)

my_str = "Hel lo, my na me is Mark."

sugs = sym_spell.lookup_compound(
    my_str,
    max_edit_distance=2,
    transfer_casing=True
)

print(" ".join([sug.term for sug in sugs]))

Output:

Hello my name is Mark

Check out their documentation for other examples and use cases.

Upvotes: 1

JK Chai
JK Chai

Reputation: 145

I think you need a Python module that contain words (like an oxford dictionary) that can check for any valid words in the string by matching the character that has space in between, for example, you can break the string into list string.split() then loop the list starting with index 1 range(1,len(your_list)) by joining the current index and the previous index list[index - 1] + list[index] into a string (i.e., token/word); then use this token to check the set of words that you have collected to see if this token is a valid word; if is true, append this token into a temporary list, if not true then just append the previous word into the temporary list, once the loop is done, you can just join the list into a string.

You can try Python spelling checker pyenchant, Python grammar checker language-check, or even using NLTK Corpora to build your own checker.

Upvotes: 0

Waket Zheng
Waket Zheng

Reputation: 6351

Is this what you want:

In [3]: finder = re.compile("([a-z])\s([a-z])")

In [4]: clean = finder.sub(r'\1\2', noisy, 1)

In [5]: clean
Out[5]: 'Hello my name is Mark'

Upvotes: 0

Related Questions