Deleting words from text list

Question

I am trying to remove certain words (in addition to using stopwords) from the list of text strings but it is not working for some reason

documents = ["Human machine interface for lab abc computer applications",
         "A survey of user opinion of computer system response time",
         "The EPS user interface management system",
         "System and human system engineering testing of EPS",
         "Relation of user perceived response time to error measurement",
         "The generation of random binary unordered trees",
         "The intersection graph of paths in trees",
         "Graph minors IV Widths of trees and well quasi ordering",
         "Graph minors A survey"]

exclude = ['am', 'there','here', 'for', 'of', 'user']

new_doc = [word for word in documents if word not in exclude]

print new_doc

OUTPUT

['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey']

As you can see, no words in EXCLUDE are removed from the DOCUMENTS (e.g. "for" is a prime example)

it works with this operator:

new_doc = [word for word in str(documents).split() if word not in exclude]

but then how do I get back the initial elements (albeit "cleaned ones") in DOCUMENTS?

I will greatly appreciate your help!

Kasravnd · Accepted Answer

You are looping over the sentences not the words.For that aim you need to split the sentences and use a nested loop to loop over your words and filter them then join the result.

>>> new_doc = [' '.join([word for word in sent.split() if word not in exclude]) for sent in documents]
>>> 
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>>

Also instead of a nested list comprehension and splitting and filtering you can use regex to replace the exclude words with an empty string with re.sub function :

>>> import re
>>> 
>>> new_doc = [re.sub(r'|'.join(exclude),'',sent) for sent in documents]
>>> new_doc
['Human machine interface  lab abc computer applications', 'A survey   opinion  computer system response time', 'The EPS  interface management system', 'System and human system engineering testing  EPS', 'Relation   perceived response time to error measurement', 'The generation  random binary unordered trees', 'The intersection graph  paths in trees', 'Graph minors IV Widths  trees and well quasi ordering', 'Graph minors A survey']
>>>

r'|'.join(exclude) will concatenate the words with an pip (means logical OR in regex).

Deleting words from text list

Answers (2)

Related Questions