balkon16
balkon16

Reputation: 1448

Python regex for multiple and single dots

I'm currently trying to clean a 1-gram file. Some of the words are as follows:

  1. word - basic word, classical case
  2. word. - basic word but with a dot
  3. w.s.f.w. - (word stands for word) - correct acronym
  4. w.s.f.w - incorrect acronym (missing the last dot)

My current implementation considers two different RegExes because I haven't succeeded in combining them in one. The first RegEx recognises basic words:

find_word_pattern = re.compile(r'[A-Za-z]', flags=re.UNICODE)

The second one is used in order to recognise acronyms:

find_acronym_pattern = re.compile(r'([A-Za-z]+(?:\.))', flags=re.UNICODE)

Let's say that I have an input_word as a sequence of characters. The output is obtained with:

"".join(re.findall(pattern, input_word))

Then I choose which output to use based on the length: the longer the output the better. My strategy works well with case no. 1 where both patterns return the same length.

Case no. 2 is problematic because my approach produces word. (with dot) but I need it to return word (without dot). Currently the case is decided in favour of find_acronym_pattern that produces longer sequence.

The case no. 3 works as expected.

The case no. 4: find_acronym_pattern misses the last character meaning that it produces w.s.f. whereas find_word_pattern produces wsfw.

I'm looking for a RegEx (preferably one instead of two that are currently used) that:

  1. given word returns word

  2. given word. returns word

  3. given w.s.f.w. returns w.s.f.w.

  4. given w.s.f.w returns w.s.f.w.

  5. given m.in returns m.in.

Upvotes: 0

Views: 2072

Answers (2)

41686d6564
41686d6564

Reputation: 19661

If you want one regex, you can use something like this:

((?:[A-Za-z](\.))*[A-Za-z]+)\.?

And substitute with:

\1\2

Regex demo.

Python 3 example:

import re

regex = r"((?:[A-Za-z](\.))*[A-Za-z]+)\.?"
test_str = ("word\n" "word.\n" "w.s.f.w.\n" "w.s.f.w\n" "m.in")
subst = "\\1\\2"

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Output:

word
word
w.s.f.w.
w.s.f.w.
m.in.

Python demo.

Upvotes: 1

alexis
alexis

Reputation: 50220

A regular expression will never return what is not there, so you can forget about requirement 5. What you can do is always drop the final period, and add it back if the result contains embedded periods. That will give you the result you want, and it's pretty straightforward:

found = re.findall(r"\w+(?:\.\w+)*", input_word)[0]
if "." in found:
    found += "."

As you see I match a word plus any number of ".part" suffixes. Like your version, this matches not only single letter acronyms but longer abbreviations like Ph.D., Prof.Dr., or whatever.

Upvotes: 2

Related Questions