Reputation: 3275
I am using the following Python - Beautifulsoup code to remove html elements from a text file:
from bs4 import BeautifulSoup
with open("textFileWithHtml.txt") as markup:
soup = BeautifulSoup(markup.read())
with open("strip_textFileWithHtml.txt", "w") as f:
f.write(soup.get_text().encode('utf-8'))
The question I have is how can I apply this code to every text file in a folder(directory), and for each text file produce a new text file which is processed and where the html elements etc. are removed, without having to call the function for each and every text file?
Upvotes: 2
Views: 1790
Reputation: 4251
The glob module lets you list all the files in a directory:
import glob
for path in glob.glob('*.txt'):
with open(path) as markup:
soup = BeautifulSoup(markup.read())
with open("strip_" + path, "w") as f:
f.write(soup.get_text().encode('utf-8'))
If you want to also do that for every subfolder recursively, check out os.walk
Upvotes: 2
Reputation: 36262
I would leave that work to the OS, simply replace the hardcoded input file with input from external source, in argv
array, and invoke the script inside a loop or with a regular expression that matches many files, like:
from bs4 import BeautifulSoup
import sys
for fi in sys.argv[1:]:
with open(fi) as markup:
soup = BeautifulSoup(markup.read())
with open("strip_" + fi, "w") as f:
f.write(soup.get_text().encode('utf-8'))
And run it like:
python script.py *.txt
Upvotes: 1