Reputation: 33
I have used HTTRACK to download Federal regulations from a government website, and the resulting HTML files are not intuitively named. Each file has a <TITLE></TITLE>
tag set, that would serve nicely to name each file in a fashion that will lend itself to ebook creation. I want to turn these regulations into an ebook for my Kindle, so that I can have the regulations readily available for reference, rather than having to carry volumes of books with me everywhere.
My preferred text/hex editor, UltraEdit Professional 15.20.0.1026, has scripting commands enable through embedding of the JavaScript engine. In researching possible solutions to my problem, I found xmlTitleSave on the IDM UltraEdit website.
// ----------------------------------------------------------------------------
// Script Name: xmlTitleSave.js
// Creation Date: 2008-06-09
// Last Modified:
// Copyright: none
// Purpose: find the <title> value in an XML document, then saves the file as the
// title.xml in a user-specified directory
// ----------------------------------------------------------------------------
//Some variables we need
var regex = "<title>(.*)</title>" //Perl regular expression to find title string
var file_path = UltraEdit.getString("Path to save file at? !! MUST PRE EXIST !!",1);
// Start at the beginning of the file
UltraEdit.activeDocument.top();
UltraEdit.activeDocument.unicodeToASCII();
// Turn on regular expressions
UltraEdit.activeDocument.findReplace.regExp = true;
// Find it
UltraEdit.activeDocument.findReplace.find(regex);
// Load it into a selection
var titl = UltraEdit.activeDocument.selection;
// Javascript function 'match' will match the regex within the javascript engine
// so we can extract the actual title via array
t = titl.match(regex);
// 't' is an array of the match from 'titl' based on the var 'regex'
// the 2nd value of the array gives us what we need... then append '.xml'
saveTitle = t[1]+".xml";
UltraEdit.saveAs(file_path + saveTitle);
// Uncomment for debugging
// UltraEdit.outputWindow.write("titl = " + titl);
// UltraEdit.outputWindow.write("t = " + t);
My question is two-fold:
<TITLE></TITLE>
contents from an HTML file and rename the files?EDIT:
I have been able to get the script to work as desired by removing the UltraEdit.activeDocument.unicodeToASCII();
line and changing the file extension to .html
. My only issue now is that while this script works on single open files, it does not batch process the directory.
Upvotes: 3
Views: 2303
Reputation: 33
After much searching and trial and error on the scripting side, I ran across a fantastic program for Windows that will do the renaming via TITLE tags: Flexible Renamer 8.3. The author's website is http://hp.vector.co.jp/authors/VA014830/english/FlexRena/, and it manages to handle every bit of what I needed. Many thanks to @coreyward and @Yuji for their fantastic advice on the scripting end of things.
Upvotes: 0
Reputation: 80128
You can use just about any "scriptable" language to do something like this pretty quickly. Ruby is my favorite:
require 'fileutils'
dir = "/your/directory"
files = Dir["#{dir}/*.html"]
files.each do |file|
html = IO.read file
title = $1 if html.match /<title>([^<]+)<\/title>/i
FileUtils.mv file "#{dir}/#{title}.html"
puts "Renamed #{file} to #{title}.html."
end
Obviously if your UltraEdit script worked for you this might be obtuse, but for anybody running a different env, hopefully this is useful.
Upvotes: 2
Reputation: 118518
Does this not work out of the box?
I don't know anything about UltraEdit, but as far as a regex engine is concerned, if it can parse <title>(.*)</title>
out of an XML document, it can do the exact same for HTML.
Just modify the final file title to .html
instead of .xml
saveTitle = t[1]+".html";
Assuming you can get that script to work as it's intended (point being I don't know UltraEdit), I'm pretty confident that same process will work for HTML.
Upvotes: 1
Reputation: 5613
XML and HTML are both plain text, and that script is simply running a regular expression on the text to extract the title tags, which are the same in both; the only thing you need to do is change this line:
saveTitle = t[1]+".xml";
to this:
saveTitle = t[1]+".html";
Upvotes: 1