Reputation: 183
I'm having trouble clearing some html via javascript regex replace. The task is to get a tv listing for my XBMC from a local source. The URL is http://tv.dir.bg/tv_search.php?step=1&all=1 (in bulgarian). I'm trying to use a scraper to get the data - http://code.google.com/p/epgss/ (credits to Ivan Markov - http://code.google.com/u/113542276020703315321/) Unfortunately the tv listings page has changed since the above tool was last updated so I'm trying to get it to work. The problem is that when I try to parse XML from the HTML it breaks. I'm now trying to clean the html a bit by regex replacing head and script tags. Unfortunately it does not work. Here's my replacer:
function regexReplace(pattern, value, replacer)
{
var regEx = new RegExp(pattern, "g");
var result = value.replaceAll(regEx, replacer);
if(result == null)
return null;
return result;
}
And here's my call:
var htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251");
log("Content grabbed (schedule for next 7 days)");
log(url);
var htmlString = regexReplace("<head>([\\s\\S]*?)<\/head>|<script([\\s\\S]*?)<\/script>", htmlStringCluttered, "");
the getHTML function comes from the original source with my minor modification of setting User-Agent. Here is its base:
public static java.io.Reader open(URL url, String charset) throws UnsupportedEncodingException, IOException
{
URLConnection con = url.openConnection();
con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.38 Safari/532.0");
con.setAllowUserInteraction(false);
con.setReadTimeout(60*1000/*ms*/);
con.connect();
if(charset == null && con instanceof HttpURLConnection) {
HttpURLConnection httpCon = (HttpURLConnection)con;
charset = httpCon.getContentEncoding();
}
if(charset == null)
charset = "UTF-8";
return new InputStreamReader(con.getInputStream(), charset);
}
The result of regexReplace is absolutely the same as the original. And since XML cannot be parsed the script cannot read the elements. Any ideas?
Upvotes: 0
Views: 383
Reputation: 25081
UPDATE:
To convert this to an XMLDocument, you can do the following:
var parseXml,
xml,
htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251"),
htmlString = '';
if (typeof window.DOMParser != "undefined") {
parseXml = function (xmlStr) {
return (new window.DOMParser()).parseFromString(xmlStr, "text/xml");
};
} else if (typeof window.ActiveXObject != "undefined" && new window.ActiveXObject("Microsoft.XMLDOM")) {
parseXml = function (xmlStr) {
var xmlDoc = new window.ActiveXObject("Microsoft.XMLDOM");
xmlDoc.async = "false";
xmlDoc.loadXML(xmlStr);
return xmlDoc;
};
} else {
throw new Error("No XML parser found");
}
console.log("Content grabbed (schedule for next 7 days)");
console.log(url);
//eliminate the '<head>' section
htmlString = htmlStringCluttered.replace(/(<head[\s\S]*<\/head>)/ig, '')
//eliminate any remaining '<script>' elements
htmlString = htmlString.replace(/(<script[\s\S]+?<\/script>)/ig, '');
//self-close '<img>' elements
htmlString = htmlString.replace(/<img([^>]*)>/g, '<img$1 />');
//self-close '<br>' elements
htmlString = htmlString.replace(/<br([^>]*)>/g, '<br$1 />');
//self-close '<input>' elements
htmlString = htmlString.replace(/<input([^>]*)>/g, '<input$1 />');
//replace ' ' entities with an actual non-breaking space
htmlString = htmlString.replace(/ /g, String.fromCharCode(160));
//convert to XMLDocument
xml = parseXml(htmlString);
//log new XMLDocument as output
console.log(xml);
//log htmlString as output
console.log(htmlString);
parseXml
function found at:XML parsing of a variable string in JavaScript
You can test this in the browser (I did :) ) simply by defining htmlStringCluttered
as:
htmlStringCluttered = document.documentElement.innerHTML;
instead of:
htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251"),
and running it in the console at http://tv.dir.bg/tv_search.php?step=1&all=1
You will also have to either comment out the line:
console.log(url);
or declare url
and give it a value.
Original:
Your RegExp needed some work, and it's much simpler (and easier to read) when broken into two replace
statements:
var htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251"),
htmlString = '';
console.log("Content grabbed (schedule for next 7 days)");
console.log(url);
//eliminate the '<head>' section
htmlString = htmlStringCluttered.replace(/(<head[\s\S]*<\/head>)/ig, '')
//eliminate any remaining '<script>' elements
htmlString = htmlString.replace(/(<script[\s\S]+?<\/script>)/ig, '');
//log remaining as output
console.log(htmlString);
This was tested in the console by visiting http://tv.dir.bg/tv_search.php?step=1&all=1 and running the following in the console:
console.log(document.documentElement.innerHTML.replace(/(<head[\s\S]*<\/head>)/ig, '').replace(/(<script[\s\S]+?<\/script>)/ig, ''));
If this is run on the outerHTML
property (as I expect the HTML.getHTML(new URL(url), "WINDOWS-1251")
method to return), then the <body>
element will be wrapped in:
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
...
</body>
</html>
Upvotes: 1