Reputation: 2503
I have a python script
$ cat ~/script.py
import sys
from lxml import etree
from lxml.html import parse
doc = parse(sys.argv[1])
title = doc.find('//title')
title.text = span2.text.strip()
print etree.tostring(doc)
I can run the script on an individual file by issuing something like
$ python script.py foo.html > new-foo.html
My problem is that I have a directory ~/webpage
that contains hundreds of .html
files scattered throughout sub-directories. I would like to run ~/script.py
on all of these html files. I am currently doing this with
$ find ~/webpage/ -name "*.html" -exec sh -c 'python ~/script.py {} > {}-new' \;
However, this creates a new file for each html file in ~/webpage
and I actually want the original file edited.
Is this possible to do from within python? Maybe with something like os.walk
?
Upvotes: 2
Views: 898
Reputation: 918
The os
module in python has a function specifically for walking down directories
Generate the file names in a directory tree by walking the tree either top-down or bottom-up. For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames).
import os
import sys
from lxml import etree
from lxml.html import parse
def parse_file(file_name):
doc = parse(file_name)
title = doc.find('//title')
title.text = span2.text.strip()
print etree.tostring(doc)
for root, dirs, files in os.walk('/path/to/webpages'):
for name in files:
parse_file(os.path.join(root, name))
Upvotes: 2
Reputation: 212
import os
def process(file_name):
with open(file_name) as readonly_file:
print "Do something with %s ,size %d" % (file_name, len(readonly_file.read()))
def traverse(directory, callback=process):
for dirpath, dirnames, filenames in os.walk(directory):
for f in filenames:
path = os.path.abspath(os.path.join(dirpath, f))
callback(path)
print traverse('./')
please rewrite process function according to you own logic, this callback accept absolute path as only parameter.
if you want process specific file only:
def traverse(directory, callback=process, file_type="txt"):
for dirpath, dirnames, filenames in os.walk(directory):
for f in filenames:
path = os.path.abspath(os.path.join(dirpath, f))
if path.endswith(file_type):
callback(path)
Upvotes: 2