Reputation: 2605
I would like to write a regex to remove the last character from a string, if the character is a (s).
However in doing so I would like to retain the (s) if it is preceded by another (s).
Example.
The output of Apples
should be Apple
.
The output of Process
should be Process
.
I need a regex that would capture the whole term if the expression is matched but would perform the replacement for a partial match.
I have used s$
to get rid of the last character.
Upvotes: 1
Views: 121
Reputation: 6808
This has been talked about WAY too many times, and the consensus is always: its WAY too complicated to be handled through a simple regex. All of the solutions fail with these examples:
apples
carrots
process
processes
tennis
A solution is to use morpha:
git clone https://github.com/knowitall/morpha
cd morpha/
flex -i -Cfea -8 -omorpha.yy.c morpha.lex
gcc -o morpha morpha.yy.c
curl -s https://raw.githubusercontent.com/jhlau/predom_sense/master/lemmatiser_tools/morpha/verbstem.list > verbstem.list
now to test:
cat test.txt | ./morpha -c
apple
carrot
process
process
tennis
If you want a python solution, i suggest you go with nltk
.
virtualenv env-nltk
source env-nltk/bin/activate
pip install nltk
python -c "import nltk; nltk.download()" # <- just get the whole thing, click "all" and then "download" on the "collections" tab
Now that everything is downloaded, lets fire off python
and play with it.
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('apples')
u'apple'
>>> lmtzr.lemmatize('tennis')
'tennis'
>>> lmtzr.lemmatize('process')
'process'
>>> lmtzr.lemmatize('processes')
u'process'
Upvotes: 4
Reputation: 8769
You could use negative lookbehind assertion to ensure substitution happens only if s
is not preceded by another s
.
>>> import re
>>> re.sub(r'(?<!s)s$', '', 'Apples')
'Apple'
>>> re.sub(r'(?<!s)s$', '', 'Process')
'Process'
Upvotes: 0
Reputation: 785128
You can use this negative lookbehind assertion:
(?<!s)s$
Breakup:
(?<!s) # assert previous position doesn't have 's'
s # match 's'
$ # assert end of line
Upvotes: 2