JavaScript regex replace does not work as expected

Question

I'm having trouble clearing some html via javascript regex replace. The task is to get a tv listing for my XBMC from a local source. The URL is http://tv.dir.bg/tv_search.php?step=1&all=1 (in bulgarian). I'm trying to use a scraper to get the data - http://code.google.com/p/epgss/ (credits to Ivan Markov - http://code.google.com/u/113542276020703315321/) Unfortunately the tv listings page has changed since the above tool was last updated so I'm trying to get it to work. The problem is that when I try to parse XML from the HTML it breaks. I'm now trying to clean the html a bit by regex replacing head and script tags. Unfortunately it does not work. Here's my replacer:

function regexReplace(pattern, value, replacer) 
{  
var regEx = new RegExp(pattern, "g");  
var result = value.replaceAll(regEx, replacer);  
if(result == null)  
return null;  
return result;  
}

And here's my call:

var htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251");  
log("Content grabbed (schedule for next 7 days)");  
log(url);  
var htmlString = regexReplace("([\s\S]*?)|", htmlStringCluttered, "");

the getHTML function comes from the original source with my minor modification of setting User-Agent. Here is its base:

    public static java.io.Reader open(URL url, String charset) throws UnsupportedEncodingException, IOException  
    {
    URLConnection con = url.openConnection();
    con.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.38 Safari/532.0");
    con.setAllowUserInteraction(false);
    con.setReadTimeout(60*1000/*ms*/);

    con.connect();

    if(charset == null && con instanceof HttpURLConnection) {
        HttpURLConnection httpCon = (HttpURLConnection)con;
        charset = httpCon.getContentEncoding();
    }

    if(charset == null)
        charset = "UTF-8";

    return new InputStreamReader(con.getInputStream(), charset);
    }

The result of regexReplace is absolutely the same as the original. And since XML cannot be parsed the script cannot read the elements. Any ideas?

pete · Accepted Answer

UPDATE:

To convert this to an XMLDocument, you can do the following:

var parseXml,
    xml,
    htmlStringCluttered = HTML.getHTML(new URL(url), "WINDOWS-1251"),
    htmlString = '';

if (typeof window.DOMParser != "undefined") {
    parseXml = function (xmlStr) {
        return (new window.DOMParser()).parseFromString(xmlStr, "text/xml");
    };
} else if (typeof window.ActiveXObject != "undefined" && new window.ActiveXObject("Microsoft.XMLDOM")) {
    parseXml = function (xmlStr) {
        var xmlDoc = new window.ActiveXObject("Microsoft.XMLDOM");
        xmlDoc.async = "false";
        xmlDoc.loadXML(xmlStr);
        return xmlDoc;
    };
} else {
    throw new Error("No XML parser found");
}

console.log("Content grabbed (schedule for next 7 days)");
console.log(url);

//eliminate the '' section
htmlString = htmlStringCluttered.replace(/()/ig, '')

//eliminate any remaining '

JavaScript regex replace does not work as expected

Answers (1)

Related Questions