prabhu
prabhu

Reputation: 929

How to remove list of words from a list of strings

Sorry if the question is bit confusing. This is similar to this question

I think this the above question is close to what I want, but in Clojure.

There is another question

I need something like this but instead of '[br]' in that question, there is a list of strings that need to be searched and removed.

Hope I made myself clear.

I think that this is due to the fact that strings in python are immutable.

I have a list of noise words that need to be removed from a list of strings.

If I use the list comprehension, I end up searching the same string again and again. So, only "of" gets removed and not "the". So my modified list looks like this

places = ['New York', 'the New York City', 'at Moscow' and many more]

noise_words_list = ['of', 'the', 'in', 'for', 'at']

for place in places:
    stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

I would like to know as to what mistake I'm doing.

Upvotes: 11

Views: 19287

Answers (4)

Tony Veijalainen
Tony Veijalainen

Reputation: 5555

Without regexp you could do like this:

places = ['of New York', 'of the New York']

noise_words_set = {'of', 'the', 'at', 'for', 'in'}
stuff = [' '.join(w for w in place.split() if w.lower() not in noise_words_set)
         for place in places
         ]
print stuff

Upvotes: 15

Manoj Govindan
Manoj Govindan

Reputation: 74705

Here is my stab at it. This uses regular expressions.

import re
pattern = re.compile("(of|the|in|for|at)\W", re.I)
phrases = ['of New York', 'of the New York']
map(lambda phrase: pattern.sub("", phrase),  phrases) # ['New York', 'New York']

Sans lambda:

[pattern.sub("", phrase) for phrase in phrases]

Update

Fix for the bug pointed out by gnibbler (thanks!):

pattern = re.compile("\\b(of|the|in|for|at)\\W", re.I)
phrases = ['of New York', 'of the New York', 'Spain has rain']
[pattern.sub("", phrase) for phrase in phrases] # ['New York', 'New York', 'Spain has rain']

@prabhu: the above change avoids snipping off the trailing "in" from "Spain". To verify run both versions of the regular expressions against the phrase "Spain has rain".

Upvotes: 11

John La Rooy
John La Rooy

Reputation: 304147

>>> import re
>>> noise_words_list = ['of', 'the', 'in', 'for', 'at']
>>> phrases = ['of New York', 'of the New York']
>>> noise_re = re.compile('\\b(%s)\\W'%('|'.join(map(re.escape,noise_words_list))),re.I)
>>> [noise_re.sub('',p) for p in phrases]
['New York', 'New York']

Upvotes: 4

wds
wds

Reputation: 32283

Since you would like to know what you are doing wrong, this line:

stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]

takes place, and then begins to loop over words. First it checks for "of". Your place (e.g. "of the New York") is checked to see if it starts with "of". It is transformed (call to replace and strip) and added to the result list. The crucial thing here is that result is never examined again. For every word you iterate over in the comprehension, a new result is added to the result list. So the next word is "the" and your place ("of the New York") doesn't start with "the", so no new result is added.

I assume the result you got eventually is the concatenation of your place variables. A simpler to read and understand procedural version would be (untested):

results = []
for place in places:
    for word in words:
        if place.startswith(word):
            place = place.replace(word, "").strip()
    results.append(place)

Keep in mind that replace() will remove the word anywhere in the string, even if it occurs as a simple substring. You can avoid this by using regexes with a pattern something like ^the\b.

Upvotes: 1

Related Questions