jsutyla
jsutyla

Reputation: 33

Using javascript to rename multiple HTML files using the <TITLE></TITLE> in each file

I have used HTTRACK to download Federal regulations from a government website, and the resulting HTML files are not intuitively named. Each file has a <TITLE></TITLE> tag set, that would serve nicely to name each file in a fashion that will lend itself to ebook creation. I want to turn these regulations into an ebook for my Kindle, so that I can have the regulations readily available for reference, rather than having to carry volumes of books with me everywhere.

My preferred text/hex editor, UltraEdit Professional 15.20.0.1026, has scripting commands enable through embedding of the JavaScript engine. In researching possible solutions to my problem, I found xmlTitleSave on the IDM UltraEdit website.

// ----------------------------------------------------------------------------
// Script Name: xmlTitleSave.js
// Creation Date: 2008-06-09
// Last Modified: 
// Copyright: none
// Purpose: find the <title> value in an XML document, then saves the file as the 
// title.xml in a user-specified directory
// ----------------------------------------------------------------------------

//Some variables we need
var regex = "<title>(.*)</title>" //Perl regular expression to find title string
var file_path = UltraEdit.getString("Path to save file at? !! MUST PRE EXIST !!",1);

// Start at the beginning of the file
UltraEdit.activeDocument.top();

UltraEdit.activeDocument.unicodeToASCII();

// Turn on regular expressions
UltraEdit.activeDocument.findReplace.regExp = true;

// Find it
UltraEdit.activeDocument.findReplace.find(regex);

// Load it into a selection
var titl = UltraEdit.activeDocument.selection;

// Javascript function 'match' will match the regex within the javascript engine 
// so we can extract the actual title via array
t = titl.match(regex);

// 't' is an array of the match from 'titl' based on the var 'regex'
// the 2nd value of the array gives us what we need... then append '.xml'
saveTitle = t[1]+".xml";

UltraEdit.saveAs(file_path + saveTitle);

// Uncomment for debugging
// UltraEdit.outputWindow.write("titl = " + titl);
// UltraEdit.outputWindow.write("t = " + t);

My question is two-fold:

  1. Can this JavaScript be modified to extract the <TITLE></TITLE> contents from an HTML file and rename the files?
  2. If the JavaScript cannot be modified easily, is there a script/program/black magic/animal sacrifice that can accomplish the same thing?

EDIT: I have been able to get the script to work as desired by removing the UltraEdit.activeDocument.unicodeToASCII(); line and changing the file extension to .html. My only issue now is that while this script works on single open files, it does not batch process the directory.

Upvotes: 3

Views: 2303

Answers (4)

jsutyla
jsutyla

Reputation: 33

After much searching and trial and error on the scripting side, I ran across a fantastic program for Windows that will do the renaming via TITLE tags: Flexible Renamer 8.3. The author's website is http://hp.vector.co.jp/authors/VA014830/english/FlexRena/, and it manages to handle every bit of what I needed. Many thanks to @coreyward and @Yuji for their fantastic advice on the scripting end of things.

Upvotes: 0

coreyward
coreyward

Reputation: 80128

You can use just about any "scriptable" language to do something like this pretty quickly. Ruby is my favorite:

require 'fileutils'

dir = "/your/directory"
files = Dir["#{dir}/*.html"]

files.each do |file|
  html = IO.read file
  title = $1 if html.match /<title>([^<]+)<\/title>/i
  FileUtils.mv file "#{dir}/#{title}.html"
  puts "Renamed #{file} to #{title}.html."
end

Obviously if your UltraEdit script worked for you this might be obtuse, but for anybody running a different env, hopefully this is useful.

Upvotes: 2

Does this not work out of the box?

I don't know anything about UltraEdit, but as far as a regex engine is concerned, if it can parse <title>(.*)</title> out of an XML document, it can do the exact same for HTML.

Just modify the final file title to .html instead of .xml

saveTitle = t[1]+".html";

Assuming you can get that script to work as it's intended (point being I don't know UltraEdit), I'm pretty confident that same process will work for HTML.

Upvotes: 1

Herohtar
Herohtar

Reputation: 5613

XML and HTML are both plain text, and that script is simply running a regular expression on the text to extract the title tags, which are the same in both; the only thing you need to do is change this line:

saveTitle = t[1]+".xml";

to this:

saveTitle = t[1]+".html";

Upvotes: 1

Related Questions