Reputation: 1448
I'm currently trying to clean a 1-gram file. Some of the words are as follows:
word
- basic word, classical caseword.
- basic word but with a dot w.s.f.w.
- (word stands for word) - correct acronymw.s.f.w
- incorrect acronym (missing the last dot)My current implementation considers two different RegExes because I haven't succeeded in combining them in one. The first RegEx recognises basic words:
find_word_pattern = re.compile(r'[A-Za-z]', flags=re.UNICODE)
The second one is used in order to recognise acronyms:
find_acronym_pattern = re.compile(r'([A-Za-z]+(?:\.))', flags=re.UNICODE)
Let's say that I have an input_word
as a sequence of characters. The output is obtained with:
"".join(re.findall(pattern, input_word))
Then I choose which output to use based on the length: the longer the output the better. My strategy works well with case no. 1 where both patterns return the same length.
Case no. 2 is problematic because my approach produces word.
(with dot) but I need it to return word
(without dot). Currently the case is decided in favour of find_acronym_pattern
that produces longer sequence.
The case no. 3 works as expected.
The case no. 4: find_acronym_pattern
misses the last character meaning that it produces w.s.f.
whereas find_word_pattern
produces wsfw
.
I'm looking for a RegEx (preferably one instead of two that are currently used) that:
given word
returns word
given word.
returns word
given w.s.f.w.
returns w.s.f.w.
given w.s.f.w
returns w.s.f.w.
given m.in
returns m.in.
Upvotes: 0
Views: 2072
Reputation: 19661
If you want one regex, you can use something like this:
((?:[A-Za-z](\.))*[A-Za-z]+)\.?
And substitute with:
\1\2
Python 3 example:
import re
regex = r"((?:[A-Za-z](\.))*[A-Za-z]+)\.?"
test_str = ("word\n" "word.\n" "w.s.f.w.\n" "w.s.f.w\n" "m.in")
subst = "\\1\\2"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Output:
word
word
w.s.f.w.
w.s.f.w.
m.in.
Upvotes: 1
Reputation: 50220
A regular expression will never return what is not there, so you can forget about requirement 5. What you can do is always drop the final period, and add it back if the result contains embedded periods. That will give you the result you want, and it's pretty straightforward:
found = re.findall(r"\w+(?:\.\w+)*", input_word)[0]
if "." in found:
found += "."
As you see I match a word plus any number of ".part" suffixes. Like your version, this matches not only single letter acronyms but longer abbreviations like Ph.D., Prof.Dr., or whatever.
Upvotes: 2