Python regex for multiple and single dots

Question

I'm currently trying to clean a 1-gram file. Some of the words are as follows:

word - basic word, classical case
word. - basic word but with a dot
w.s.f.w. - (word stands for word) - correct acronym
w.s.f.w - incorrect acronym (missing the last dot)

My current implementation considers two different RegExes because I haven't succeeded in combining them in one. The first RegEx recognises basic words:

find_word_pattern = re.compile(r'[A-Za-z]', flags=re.UNICODE)

The second one is used in order to recognise acronyms:

find_acronym_pattern = re.compile(r'([A-Za-z]+(?:\.))', flags=re.UNICODE)

Let's say that I have an input_word as a sequence of characters. The output is obtained with:

"".join(re.findall(pattern, input_word))

Then I choose which output to use based on the length: the longer the output the better. My strategy works well with case no. 1 where both patterns return the same length.

Case no. 2 is problematic because my approach produces word. (with dot) but I need it to return word (without dot). Currently the case is decided in favour of find_acronym_pattern that produces longer sequence.

The case no. 3 works as expected.

The case no. 4: find_acronym_pattern misses the last character meaning that it produces w.s.f. whereas find_word_pattern produces wsfw.

I'm looking for a RegEx (preferably one instead of two that are currently used) that:

given word returns word
given word. returns word
given w.s.f.w. returns w.s.f.w.
given w.s.f.w returns w.s.f.w.
given m.in returns m.in.

41686d6564 · Accepted Answer

If you want one regex, you can use something like this:

((?:[A-Za-z](\.))*[A-Za-z]+)\.?

And substitute with:

\1\2

Regex demo.

Python 3 example:

import re

regex = r"((?:[A-Za-z](\.))*[A-Za-z]+)\.?"
test_str = ("word
" "word.
" "w.s.f.w.
" "w.s.f.w
" "m.in")
subst = "\1\2"

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Output:

word
word
w.s.f.w.
w.s.f.w.
m.in.

Python demo.

Python regex for multiple and single dots

Answers (2)

Related Questions