Reputation: 2185
I have a list c
whis has 353000 elements. Each element is a parse string. A sample of this list is:
print c[25:50]
['aluminum co of america', 'aluminum co of america', 'aluminum co of america', 'aluminum company of america', 'aluminum company of america', 'aluminum co of america', 'aluminum company of america', 'aluminum company of america', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'asset acceptance capital corp.', 'ace cash express, inc.', 'ace cash express, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.', 'airtran holdings, inc.']
I counted the frequency of words in the list:
from collections import Counter
r=[]
for e in c:
r.extend(e.split())
count=Counter(r)
So, the six most frequent words of the list are :
{'inc.': 18670, 'corporation': 9255, 'company': 2632, 'group,': 1190, '&': 1158, 'financial': 1025}
I would like to remove these elements of my list. For example if I have "aluminum corporation of america"
, the output should be "aluminum of america"
. Is there any help?
Upvotes: 1
Views: 224
Reputation: 1190
You could use regular expressions to substitute an empty string for the words you want to delete:
import re
p = re.compile(' |'.join(word for word in count))
cleaned = [p.sub('', item) for item in c]
edit: Although, you'd have to escape the .
s and &
in your regex, so it will become a bit more complex than above...
Upvotes: 1
Reputation: 239453
# Using Generator Expression with `Counter` to speed it up a little bit
from collections import Counter
count = Counter(item for e in c for item in e.split())
# Get most frequently used words
words = {item for item, cnt in count.most_common(6)}
# filter the `words` in `c` and reconstruct the sentences in `c`
[" ".join([item for item in e.split() if item not in words]) for e in c]
Upvotes: 1