Ila
Ila

Reputation: 3538

Parse each file in a directory with BeautifulSoup/Python, save out as new file

New to Python & BeautifulSoup. I have a Python program that opens a file called "example.html", runs a BeautifulSoup action on it, then runs a Bleach action on it, then saves the result as file "example-cleaned.html". So far it is working for all contents of "example.html".

I need to modify it so that it opens each file in folder "/posts/", runs the program on it, then saves it out as "/posts-cleaned/X-cleaned.html" where X is the original filename.

Here's my code, minimised:

from bs4 import BeautifulSoup
import bleach
import re

text = BeautifulSoup(open("posts/example.html"))
text.encode("utf-8")

tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}

# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())

# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

fout = open("posts/example-cleaned.html", "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Done"

Assistance & pointers to existing solutions gladly received!

Upvotes: 5

Views: 4795

Answers (2)

NullUserException
NullUserException

Reputation: 85468

You can use os.listdir() to get a list of all files in a directory. If you want to recurse all the way down the directory tree, you'll need os.walk().

I would move all this code to handle a single file to function, and then write a second function to handle parsing the whole directory. Something like this:

def clean_dir(directory):

    os.chdir(directory)

    for filename in os.listdir(directory):
        clean_file(filename)

def clean_file(filename):

    tag_black_list = ['iframe', 'script']
    tag_white_list = ['p','div']
    attr_white_list = {'*': ['title']}

    with open(filename, 'r') as fhandle:
        text = BeautifulSoup(fhandle)
        text.encode("utf-8")

        # Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
        [s.decompose() for s in text(tag_black_list)]
        pretty = (text.prettify())

        # Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
        cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

        # this appends -cleaned to the file; 
        # relies on the file having a '.'
        dot_pos = filename.rfind('.')
        cleaned_filename = '{0}-cleaned{1}'.format(filename[:dot_pos], filename[dot_pos:])

        with open(cleaned_filename, 'w') as fout:
            fout.write(cleaned.encode("utf-8"))

    print "Done"

Then you just call clean_dir('/posts') or what not.

I'm appending "-cleaned" to the files, but I think I like your idea of using a whole new directory better. That way you won't have to handle conflicts if -cleaned already exists for some file, etc.

I'm also using the with statement to open files here as it closes them and handles exceptions automatically.

Upvotes: 5

Ila
Ila

Reputation: 3538

Answer to my own question, for others who might find the Python docs for os.listdir a bit unhelpful:

from bs4 import BeautifulSoup
import bleach
import re
import os, os.path

tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}

postlist = os.listdir("posts/")

for post in postlist: 

        # HERE: you need to specify the directory again, the value of "post" is just the filename:
    text = BeautifulSoup(open("posts/"+post))
    text.encode("utf-8")

    # Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
    [s.decompose() for s in text(tag_black_list)]
    pretty = (text.prettify())

    # Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
    cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)

    fout = open("posts-cleaned/"+post, "w")
    fout.write(cleaned.encode("utf-8"))
    fout.close()

I cheated and made a separate folder called "posts-cleaned/" because savings files to there was easier than splitting the filename, adding "cleaned", and re-joining it, although if anyone wants to show me a good way to do that, that would be even better.

Upvotes: 2

Related Questions