takashi88
takashi88

Reputation: 37

Separating compound nouns from basic nouns part 2

Well I previously asked a question and I got the answer what I wanted. However I have more questions now.

I have a list that goes like this:

name = ['road', 'roadwork', 'pill', 'pillbox', 'pillow', 'ball',
'football', 'basketball', 'work', 'box', 'foot', 'basket']

The code below separates the words with compound nouns from the base words:

for candidate in name:
    for word in name:
        if word != candidate and word in candidate:
            break      
        else:              
            print candidate

However I realise that the code is too restrictive because it also removes "pillow" from the list.

Is there a code that can generate the below outcome:

name = ['road', 'pill', 'pillow', 'ball', 'work', 'box', 'foot', 'basket']

Upvotes: 1

Views: 356

Answers (2)

Kevin
Kevin

Reputation: 76234

For your average word, the simplest way to determine if it is a compound word is to chop it in half and see if both halves are words. You have to test repeatedly with different chopping points, so the run time is proportional to the length of the word. It should be reasonably fast for any English word, other than 189,000 character long chemical names.

words = ['road', 'roadwork', 'pill', 'pillbox', 'pillow', 'ball', 'football', 'basketball', 'work', 'box', 'foot', 'basket']

wordSet = set(words)

def isWord(w):
    return w in wordSet

def isCompoundWord(word):
    for idx in range(1, len(word)):
        left = word[:idx]
        right = word[idx:]
        if isWord(left) and isWord(right):
            return True
    return False

nonCompoundWords = [word for word in words if not isCompoundWord(word)]
print nonCompoundWords

output:

['road', 'pill', 'pillow', 'ball', 'work', 'box', 'foot', 'basket']

Upvotes: 1

mbowden
mbowden

Reputation: 714

You will need to find if what remains of the word after subtracting the match is another word. There will be situations, I imagine where the etymology won't match up. I'm thinking words that include another word plus 'is' where 'is' is not used as it's meaning, for example.

Edit: for example:

words = ['book','store','bookstore','booking']
li = []
for word in words:
    for test in words:
        if test in word:
            temp = word[len(test):]
            if temp in words and word not in li:
                li.append(word) 

for x in li:
    words.remove(x)
print words

Upvotes: 0

Related Questions