Juraj
Juraj

Reputation: 63

Getting rid of specific words in file

I'm working on the spam filter and in files, I've got also emails in HTML, so there are parts such as:

br></font><br><br><br><br><br><br><br><br><br><br><br><br><br=
><br><br><br></font></p></center></center></tr></tbody></table></center></=
center></center></center></center></body></html>

I'm ignoring them the way:

if word[0] == '<' or word[len(word)-1] == '>':

But there are still parts passing into mi dictionary. I was searching for some way how to ignore these words, but with no success. Is there some library in python solving this problem or do anybody knows more efficient way to code it?

Right now I read words like:

mail_words = {}
with open(email, 'r', encoding='utf-8') as file:
       text_of_mail = file.read()
        words = text_of_mail.split()
        words = [w.translate(str.maketrans("", "", "0123456789”#%&\’()*+,-./:;=?@[\\]^_`{|}~’")) for w in words]



for word in words:
  if word == '' or word == ' ' or word == '\n' or word[0] == '<' or word[len(word)-1] == '>':
                pass
  elif word not in mail_words:
      mail_words[word] = 1
  else:
      mail_words[word] += 1

Appreciate

Upvotes: 1

Views: 37

Answers (1)

Patrick Artner
Patrick Artner

Reputation: 51643

Instead of using maketrans - use the builtin lightweight html parser:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    """Adjusted from https://docs.python.org/3/library/html.parser.html"""
    data_set = set()

    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        print("Encountered some data  :", data)
        self.data_set.add(data)


parser = MyHTMLParser()

# well formed html example 
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

print(parser.data_set)

Output:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

{'Test', 'Parse me!'}       # parser.data_set  content

You would use it like so:

parser = MyHTMLParser()
with open(email, 'r', encoding='utf-8') as file:
    parser.feed(file.read())
print(parser.data_set)

You then postprocess the resulting set - f.e. by

# remove entries consisting purely out of whitespaces \t \n etc.
cleaned = {a.strip() for a in parser.data_set if a.strip()}

Upvotes: 1

Related Questions