Reputation: 3169
I am trying to remove certain words (in addition to using stopwords) from the list of text strings but it is not working for some reason
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
exclude = ['am', 'there','here', 'for', 'of', 'user']
new_doc = [word for word in documents if word not in exclude]
print new_doc
OUTPUT
['Human machine interface for lab abc computer applications', 'A survey of user opinion of computer system response time', 'The EPS user interface management system', 'System and human system engineering testing of EPS', 'Relation of user perceived response time to error measurement', 'The generation of random binary unordered trees', 'The intersection graph of paths in trees', 'Graph minors IV Widths of trees and well quasi ordering', 'Graph minors A survey']
As you can see, no words in EXCLUDE are removed from the DOCUMENTS (e.g. "for" is a prime example)
it works with this operator:
new_doc = [word for word in str(documents).split() if word not in exclude]
but then how do I get back the initial elements (albeit "cleaned ones") in DOCUMENTS?
I will greatly appreciate your help!
Upvotes: 3
Views: 87
Reputation: 107287
You are looping over the sentences not the words.For that aim you need to split the sentences and use a nested loop to loop over your words and filter them then join the result.
>>> new_doc = [' '.join([word for word in sent.split() if word not in exclude]) for sent in documents]
>>>
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>>
Also instead of a nested list comprehension and splitting and filtering you can use regex
to replace the exclude
words with an empty string with re.sub
function :
>>> import re
>>>
>>> new_doc = [re.sub(r'|'.join(exclude),'',sent) for sent in documents]
>>> new_doc
['Human machine interface lab abc computer applications', 'A survey opinion computer system response time', 'The EPS interface management system', 'System and human system engineering testing EPS', 'Relation perceived response time to error measurement', 'The generation random binary unordered trees', 'The intersection graph paths in trees', 'Graph minors IV Widths trees and well quasi ordering', 'Graph minors A survey']
>>>
r'|'.join(exclude)
will concatenate the words with an pip (means logical OR in regex).
Upvotes: 1
Reputation: 10135
You should split lines to words before filter them:
new_doc = [' '.join([word for word in line.split() if word not in exclude]) for line in documents]
Upvotes: 3