Reputation: 25
I have a huge text file which contains categories Like this:
mango
mangoes
orange
oranges
cat
cats
I want to remove those plural words from the line. So that it remains:
mango
orange
cat
Upvotes: 0
Views: 1916
Reputation: 55750
The problem is not a good fit for regular expressions (the question was tagged Regex at time of writing). Regular expressions are good for matching patterns and regular languages. English is not a regular language (that is, English is not a formal language that can be expressed using regular expressions) just as HTML and XML are not regular languages. The plural form in English is actually a good way to demonstrate the problem: plural of car
is cars
but plural for bus
is not buss
but busses
. And just as the question presents, the plural for mango
is not the regular form mangos
but mangoes
. And what's worse, not all nouns that end in o
form the plural by adding oes
- the plural of piano
is pianos
not pianoes
.. And what about wolf
and wife
going to wolves
and wives
, and child
going to children
?
So I hope you're convined - you're bound to run into trouble.
You'll have to write up some list of exceptions to the regular plural form which adds an s
after the singular form.
What you need is to implement a basic stemmer (one that is only concerned with the plural form). For further reading see: http://tartarus.org/martin/PorterStemmer/
Once you stem words you can use a hash set to check for duplicates efficiently. A single pass over the words, stem and add to set if not already in the set. If already in the set - remove the word since it was a duplicate. The only problem is this will not guarantee you're removing the plural form. The problem is not very easy without an english dictionary.
If you want really good accuracy you'll need to use a dictionary of English words that maps singular to plural.
Upvotes: 7
Reputation: 12782
If you just want to filter out lines ending with s
:
grep -P '[^s]$' file.txt > newfile.txt
Upvotes: -1