Reputation: 3538
New to Python & BeautifulSoup. I have a Python program that opens a file called "example.html", runs a BeautifulSoup action on it, then runs a Bleach action on it, then saves the result as file "example-cleaned.html". So far it is working for all contents of "example.html".
I need to modify it so that it opens each file in folder "/posts/", runs the program on it, then saves it out as "/posts-cleaned/X-cleaned.html" where X is the original filename.
Here's my code, minimised:
from bs4 import BeautifulSoup
import bleach
import re
text = BeautifulSoup(open("posts/example.html"))
text.encode("utf-8")
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("posts/example-cleaned.html", "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Done"
Assistance & pointers to existing solutions gladly received!
Upvotes: 5
Views: 4795
Reputation: 85468
You can use os.listdir()
to get a list of all files in a directory. If you want to recurse all the way down the directory tree, you'll need os.walk()
.
I would move all this code to handle a single file to function, and then write a second function to handle parsing the whole directory. Something like this:
def clean_dir(directory):
os.chdir(directory)
for filename in os.listdir(directory):
clean_file(filename)
def clean_file(filename):
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
with open(filename, 'r') as fhandle:
text = BeautifulSoup(fhandle)
text.encode("utf-8")
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
# this appends -cleaned to the file;
# relies on the file having a '.'
dot_pos = filename.rfind('.')
cleaned_filename = '{0}-cleaned{1}'.format(filename[:dot_pos], filename[dot_pos:])
with open(cleaned_filename, 'w') as fout:
fout.write(cleaned.encode("utf-8"))
print "Done"
Then you just call clean_dir('/posts')
or what not.
I'm appending "-cleaned" to the files, but I think I like your idea of using a whole new directory better. That way you won't have to handle conflicts if -cleaned
already exists for some file, etc.
I'm also using the with
statement to open files here as it closes them and handles exceptions automatically.
Upvotes: 5
Reputation: 3538
Answer to my own question, for others who might find the Python docs for os.listdir a bit unhelpful:
from bs4 import BeautifulSoup
import bleach
import re
import os, os.path
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
postlist = os.listdir("posts/")
for post in postlist:
# HERE: you need to specify the directory again, the value of "post" is just the filename:
text = BeautifulSoup(open("posts/"+post))
text.encode("utf-8")
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("posts-cleaned/"+post, "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
I cheated and made a separate folder called "posts-cleaned/" because savings files to there was easier than splitting the filename, adding "cleaned", and re-joining it, although if anyone wants to show me a good way to do that, that would be even better.
Upvotes: 2