Reputation: 25

Remove plural words from a text file

I have a huge text file which contains categories Like this:

mango    
mangoes   
orange   
oranges   
cat   
cats

I want to remove those plural words from the line. So that it remains:

mango   
orange   
cat

Upvotes: 0

Answers (2)

Mike Dinescu

Reputation: 55750

The problem is not a good fit for regular expressions (the question was tagged Regex at time of writing). Regular expressions are good for matching patterns and regular languages. English is not a regular language (that is, English is not a formal language that can be expressed using regular expressions) just as HTML and XML are not regular languages. The plural form in English is actually a good way to demonstrate the problem: plural of car is cars but plural for bus is not buss but busses. And just as the question presents, the plural for mango is not the regular form mangos but mangoes. And what's worse, not all nouns that end in o form the plural by adding oes - the plural of piano is pianos not pianoes.. And what about wolf and wife going to wolves and wives, and child going to children?

So I hope you're convined - you're bound to run into trouble.

You'll have to write up some list of exceptions to the regular plural form which adds an s after the singular form.

What you need is to implement a basic stemmer (one that is only concerned with the plural form). For further reading see: http://tartarus.org/martin/PorterStemmer/

Once you stem words you can use a hash set to check for duplicates efficiently. A single pass over the words, stem and add to set if not already in the set. If already in the set - remove the word since it was a duplicate. The only problem is this will not guarantee you're removing the plural form. The problem is not very easy without an english dictionary.

If you want really good accuracy you'll need to use a dictionary of English words that maps singular to plural.

Upvotes: 7

Fabricator

Reputation: 12782

If you just want to filter out lines ending with s:

grep -P '[^s]$' file.txt > newfile.txt

Upvotes: -1

Remove plural words from a text file

Answers (2)

Related Questions