Reputation: 63
I'm working on the spam filter and in files, I've got also emails in HTML, so there are parts such as:
br></font><br><br><br><br><br><br><br><br><br><br><br><br><br=
><br><br><br></font></p></center></center></tr></tbody></table></center></=
center></center></center></center></body></html>
I'm ignoring them the way:
if word[0] == '<' or word[len(word)-1] == '>':
But there are still parts passing into mi dictionary. I was searching for some way how to ignore these words, but with no success. Is there some library in python solving this problem or do anybody knows more efficient way to code it?
Right now I read words like:
mail_words = {}
with open(email, 'r', encoding='utf-8') as file:
text_of_mail = file.read()
words = text_of_mail.split()
words = [w.translate(str.maketrans("", "", "0123456789”#%&\’()*+,-./:;=?@[\\]^_`{|}~’")) for w in words]
for word in words:
if word == '' or word == ' ' or word == '\n' or word[0] == '<' or word[len(word)-1] == '>':
pass
elif word not in mail_words:
mail_words[word] = 1
else:
mail_words[word] += 1
Appreciate
Upvotes: 1
Views: 37
Reputation: 51643
Instead of using maketrans - use the builtin lightweight html parser:
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
"""Adjusted from https://docs.python.org/3/library/html.parser.html"""
data_set = set()
def handle_starttag(self, tag, attrs):
print("Encountered a start tag:", tag)
def handle_endtag(self, tag):
print("Encountered an end tag :", tag)
def handle_data(self, data):
print("Encountered some data :", data)
self.data_set.add(data)
parser = MyHTMLParser()
# well formed html example
parser.feed('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
print(parser.data_set)
Output:
Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html
{'Test', 'Parse me!'} # parser.data_set content
You would use it like so:
parser = MyHTMLParser()
with open(email, 'r', encoding='utf-8') as file:
parser.feed(file.read())
print(parser.data_set)
You then postprocess the resulting set - f.e. by
# remove entries consisting purely out of whitespaces \t \n etc.
cleaned = {a.strip() for a in parser.data_set if a.strip()}
Upvotes: 1