JBurace
JBurace

Reputation: 5633

Dynamically group poorly-structured HTML, that has no IDs?

There is a very old website I use and the data is not displayed in a friendly fashion. I would like to write a userscript (javascript/jQuery) that assists the readability of this site. The content looks like (the HTML comments are my own, to help show this):

<font size="3" face="Courier">
  <br>
  <!-- Begin entry 1 -->
  Name1 (Location1) - Date1:
  <br>
  Text1
  <br>
  Text1 (continued)
  <br>
  Text1 (continued)
  <br>
  <br>
  <!-- Begin entry 2 -->
  Name2 (Location2) - Date2:
  <br>
  Text2
  <br>
  Text2 (continued)
  <br>
  <br>
  Text2 (continued)
  <br>
  Text2 (continued)
  <br>
  <br>
  <!-- Begin entry 3 -->
  Name3 (Location3) - Date3:
  <br>
  Text3
  <br>
  Text3 (continued)
  <br>
  Text3 (continued)
  <br>
  <br>
  <br>
  Text3 (continued)
  <br>
  Text3 (continued)
  <!-- Below is Text3, but a user copied Entry1 here --> 
  Name1 (Location1) - Date1: <!-- text3 -->
  <br> <!-- text3 -->
  Text1 <!-- text3 -->
  <br> <!-- text3 -->
  Text1 (continued) <!-- text3 -->
  <br> <!-- text3 -->
  Text1 (continued) <!-- text3 -->
  <br>
  <br>
  <!-- Begin entry 4 -->
  Name4 ...
  ......
</font>

As you can see, two <br> always come before the next "entry" (name, location, date) but since the text is free text it can also contain various <br> including 2 or more. Another issue is if the text also contains Name (Location) - Date pasted from say another entry elsewhere.

So if I wanted to write a script that could be added to Google Chrome where it say added a button that would collapse (or uncollapse if already collapsed) each entry, is that possible? The issue I'm having is that since there is no unique element starting or ending an entry, I'm not sure how to begin this.

The general concept is to loop through each "entry" (header being name/location/date) and the text that follows that up until the next header) and allow each "entry" to be collapsible (such as Reddit comments are collapsible).

Or for a more simple concept, what if I wanted to mark every other entry with red font? So then all of entry1 would be black font, entry2 would be red font, entry3 would be black font, entry4 would be red font, and so on.

Upvotes: 0

Views: 282

Answers (4)

Brock Adams
Brock Adams

Reputation: 93533

For this kind of thing, parse the entries in a state-machine loop.

The following code was always the first answer to:

  1. Group the HTML as specified in the question.
  2. Provide click control to expand/contract the groupings.
  3. Collapse entries to start -- for better initial overview.

See a demo of it at jsFiddle.

UPDATE:

The question's HTML did not match the actual page structure. Updated the script below to account for that, and also added the CSS to the script-code:

var containerNode       = document.querySelector ("p font xpre");
var contentNodes        = containerNode.childNodes;
var tempContainer       = document.createElement ("div");
var groupingContainer   = null;
var hidableDiv          = null;
var bInEntry            = false;
var bPrevNodeWasBr      = false;

for (var J = 0, numKids = contentNodes.length;  J < numKids;  ++J) {
    var node            = contentNodes[J];

    //--- Is the node an entry start?
    if (    node.nodeType === Node.TEXT_NODE
        &&  bPrevNodeWasBr
        &&  /^\s*\w.*\s\(.+?\)\s+-\s+\w.+?:\s*$/.test (node.textContent)
    ) {
        //--- End the previous grouping, if any and start a new one.
        if (bInEntry) {
            groupingContainer.appendChild (hidableDiv);
            tempContainer.appendChild (groupingContainer);
        }
        else
            bInEntry        = true;

        groupingContainer   = document.createElement ("div");
        groupingContainer.className = "groupingDiv";

        /*--- Put the entry header in a special <span> to allow for
            expand/contract functionality.
        */
        var controlSpan         = document.createElement ("span");
        controlSpan.className   = "expandCollapse";
        controlSpan.textContent = node.textContent;
        groupingContainer.appendChild (controlSpan);

        //--- Since we can't style text nodes, put everythin in this sub-wrapper.
        hidableDiv          = document.createElement ("div");
    }
    else if (bInEntry) {
        //--- Put a copy of the current node to the latest grouping container.
        hidableDiv.appendChild (node.cloneNode(false) );
    }

    if (    node.nodeType === Node.ELEMENT_NODE
        &&  node.nodeName === "BR"
    ) {
        bPrevNodeWasBr  = true;
    }
    else
        bPrevNodeWasBr  = false;
}

//--- Finish up the last entry, if any.
if (bInEntry) {
    groupingContainer.appendChild (hidableDiv);
    tempContainer.appendChild (groupingContainer);
}

/*--- If we have done any grouping, replace the original container contents
    with our collection of grouped nodes.
*/
if (numKids) {
    while (containerNode.hasChildNodes() ) {
        containerNode.removeChild (containerNode.firstChild);
    }

    while (tempContainer.hasChildNodes() ) {
        containerNode.appendChild (tempContainer.firstChild);
    }
}

//--- Initially collapse all sections and make the control spans clickable.
var entryGroups         = document.querySelectorAll ("div.groupingDiv span.expandCollapse");
for (var J = entryGroups.length - 1;  J >= 0;  --J) {
    ExpandCollapse (entryGroups[J]);

    entryGroups[J].addEventListener ("click", ExpandCollapse, false);
}


//--- Add the CSS styles that make this work well...
addStyleSheet ( "                                                   \
    div.groupingDiv {                                               \
        border:         1px solid blue;                             \
        margin:         1ex;                                        \
        padding:        1ex;                                        \
    }                                                               \
    span.expandCollapse {                                           \
        background:     lime;                                       \
        cursor:         pointer;                                    \
    }                                                               \
    div.groupingDiv     span.expandCollapse:before {                \
        content:        '-';                                        \
        background:     white;                                      \
        font-weight:    bolder;                                     \
        font-size:      150%;                                       \
        padding:        0 1ex 0 0;                                  \
    }                                                               \
    div.groupingDiv     span.expandCollapse.collapsed:before {      \
        content:        '+';                                        \
    }                                                               \
" );


//--- Functions used...
function ExpandCollapse (eventOrNode) {
    var controlSpan;
    if (typeof eventOrNode.target == 'undefined')
        controlSpan     = eventOrNode;
    else
        controlSpan     = eventOrNode.target;

    //--- Is it currently expanded or contracted?
    var bHidden;
    if (/\bcollapsed\b/.test (controlSpan.className) ) {
        bHidden         = true;
        controlSpan.className = controlSpan.className.replace (/\s*collapsed\s*/, "");
    }
    else {
        bHidden         = false;
        controlSpan.className += " collapsed";
    }

    //--- Now expand or collapse the matching group.
    var hidableDiv      = controlSpan.parentNode.children[1];
    hidableDiv.style.display    = bHidden ? "" : "none";
}


function addStyleSheet (text) {
    var D                   = document;
    var styleNode           = D.createElement ('style');
    styleNode.type          = "text/css";
    styleNode.textContent   = text;

    var targ = D.getElementsByTagName ('head')[0] || D.body || D.documentElement;
    //--- Don't error check here. if DOM not available, should throw error.
    targ.appendChild (styleNode);
}

If nested/quoted entries are to be wrapped separately, you will also need to recurse. For nested/quoted entries, open a new question after this one is answered.

Note: The new sample HTML has multiple pairs of <html> tags and 2 sets of entries! This is probably a cut-and-paste error, but if it is not, open a new question if help is needed for the easy mod to process multiple sets.

Upvotes: 1

jfriend00
jfriend00

Reputation: 707716

You would have to figure out how to search the DOM to find the elements you want. For example, you can find things by tag name and then examine the context around a given tag to see if it's what you are looking for.

If you provide more info on what exactly you're trying to find, we could likely help with more specific code.

For example, document.getElementsByTagName("br") finds all <br> tags in the document. You could examine each one to find double <br> tags if that's what you're trying to find or if you're looking for some specific text before or after double <br>tags, you could look for that too. As I said in my comment, you need to be more specific about what pattern you're actually looking for before more specific code can be suggeseted.

For example, here's how you would search for a particular text pattern that follows a <br> tag in your document:

var items = document.getElementsByTagName("br");
// modify this regex to suit what you're trying to match
var re = /\w+\s\(\w+\)/;
for (var i = 0, len = items.length; i < len; i++) {
    var node = items[i];
    while ((node = node.nextSibling) && node.nodeType == 3) {
        if (re.test(node.nodeValue)) {
            // add a marker test node (just for test purposes)
            var span = document.createElement("span");
            span.className = "marker";
            span.innerHTML = "X";
            node.parentNode.insertBefore(span, node.nextSibling);
        }            
    }        
}​

You can modify the regex to be whatever you want the search to be looking for.

You can see a working demo here: http://jsfiddle.net/jfriend00/s9VMn/


OK, here's one more shot at guessing what pattern you're looking for using a regular expression. This looks for two successive <br> tags followed by text that matches the pattern. It then wraps that text in a span so it can be styled according to even or odd.

function getTextAfter(node) {
    // collect text from successive text nodes
    var txt = "";
    while ((node = node.nextSibling) && node.nodeType == 3) {
           txt += node.nodeValue;
    }
    return(txt);    
}

function wrapTextInSpan(preNode, cls) {
    // collect successive text nodes
    // into a span tag
    var node = preNode, item;
    var span = document.createElement("span");
    span.className = cls;
    node = node.nextSibling;
    while (node && node.nodeType == 3) {
        item = node;
        node = node.nextSibling;
        span.appendChild(item);
    }
    preNode.parentNode.insertBefore(span, preNode.nextSibling);
    return(span);
}

// find double br tags
var items = document.getElementsByTagName("br");
var cnt = 1;
var re = /\w+\s+\([^)]+\)\s+-\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d+,\s+\d+\d+/i;
for (var i = 0, len = items.length; i < len; i++) {
    var node = items[i];
    // collect text from successive text nodes
    var txt = "";
    while ((node = node.nextSibling) && node.nodeType == 3) {
           txt += node.nodeValue;
    }
    // if no text, check for successive BR tags
    if (txt.replace(/\n|\s/g, "") == "") {
        if (i + 1 < len && node === items[i + 1]) {
            // found a double BR tag
            // get the text after it
            txt = getTextAfter(node);
            if (re.test(txt)) {
                wrapTextInSpan(node, "marker" + (cnt % 2 ? "Odd" : "Even"));
                ++cnt;
            }
            ++i;
        }
    }
}
​

Working demo here: http://jsfiddle.net/jfriend00/ewApy/


Here's one more version that actually inserts an expand/collapse target and does the expand/collapse of the sections. This could be so easy with the right HTML and with a nice library like jQuery, but without either it's a lot more code:

function getTextAfter(node) {
    // collect text from successive text nodes
    var txt = "";
    while ((node = node.nextSibling) && node.nodeType == 3) {
           txt += node.nodeValue;
    }
    return(txt);    
}

function wrapTextInSpan(preNode, cls) {
    // collect successive text nodes
    // into a span tag
    var node = preNode, item;
    var span = document.createElement("span");
    span.className = cls;
    node = node.nextSibling;
    while (node && node.nodeType == 3) {
        item = node;
        node = node.nextSibling;
        span.appendChild(item);
    }
    preNode.parentNode.insertBefore(span, preNode.nextSibling);
    return(span);
}

function wrapBetweenInSpan(preNode, postNode, cls) {
    var node = preNode, item;
    var span = document.createElement("span");
    span.className = cls;
    node = node.nextSibling;
    if (node && node.nodeType == 1 && node.tagName == "BR") {
        preNode = node;
        node = node.nextSibling;
    }
    while (node && node != postNode) {
        item = node;
        node = node.nextSibling;
        span.appendChild(item);
    }
    preNode.parentNode.insertBefore(span, preNode.nextSibling);
    return(span);
}

function toggleClass(el, cls) {
    var str = " " + el.className + " ";
    if (str.indexOf(" " + cls + " ") >= 0) {
        str = str.replace(cls, "").replace(/\s+/, " ").replace(/^\s+|\s+%/, "");
        el.className = str;
    } else {
        el.className = el.className + " " + cls;
    }
}

function hasClass(el, cls) {
    var str = " " + el.className + " ";
    return(str.indexOf(" " + cls + " ") >= 0);    
}

function addButton(target) {
    var span = document.createElement("span");
    span.className = "expandoButton";
    span.innerHTML = "+++";
    span.onclick = function(e) {
        var expando = this;
        do {
            expando = expando.nextSibling;
        } while (expando && !hasClass(expando, "markerContents"));
        toggleClass(expando, "notshown");
    };
    target.parentNode.insertBefore(span, target.nextSibling);
}

// find double br tags
var items = document.getElementsByTagName("br");
var cnt = 1;
var spans = [];
var re = /\w+\s+\([^)]+\)\s+-\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d+,\s+\d+\d+/i;
for (var i = 0, len = items.length; i < len; i++) {
    var node = items[i];
    // collect text from successive text nodes
    var txt = "";
    while ((node = node.nextSibling) && node.nodeType == 3) {
           txt += node.nodeValue;
    }
    // if no text, check for successive BR tags
    if (txt.replace(/\n|\s/g, "") == "") {
        if (i + 1 < len && node === items[i + 1]) {
            // found a double BR tag
            // get the text after it
            txt = getTextAfter(node);
            if (re.test(txt)) {
                var span = wrapTextInSpan(node, "marker marker" + (cnt % 2 ? "Odd" : "Even"));
                spans.push(span);
                ++cnt;
            }
            ++i;
        }
    }
}

// now wrap the contents of each marker
for (i = 0, len = spans.length; i < len; i++) {
    wrapBetweenInSpan(spans[i], spans[i+1], "markerContents shown");
    addButton(spans[i]);
}
​

Working demo of this version: http://jsfiddle.net/jfriend00/cPbqC/

Upvotes: 2

nnnnnn
nnnnnn

Reputation: 150080

There are a number of methods that let you select elements without knowing the id, e.g.:

UPDATE: I don't see any way to distinguish between two <br> elements in a row that are an end-of-entry marker and two <br> elements in a row that are simply part of a particular entry. From your examples, the "text" entries can contain anything that might have been in the name/location/date line. So simplifying it slightly and taking every double-br as an end of entry you can do something like this:

window.onload = function() {
    var fontTags = document.getElementsByTagName("font"),
        i, j = 0;

    for (i = 0; i < fontTags.length; i++)
        fontTags[i].innerHTML = '<div class="entry odd">' +
            fontTags[i].innerHTML.replace(/<br>\s*?<br>/g, function() {
            return '</div><div class="entry ' + (j++ %2===0?'even':'odd') + '">';
        }) + '</div>';
};

This assumes all font elements contain data to be processed and uses .replace() to find the double-br occurrences and put wrapper divs around each entry instead. I've given every div a class "entry", and then alternate ones the classes "even" and "odd" so that you can then apply a style like this:

div.odd { color : red; }

As shown in this demo: http://jsfiddle.net/C4h7s/

Obviously you could use inline styles to set the colours if you can't add classes to the stylesheet.

That's the closest I could get to your every-other-entry-is-red requirement. I'm not actually using the "entry" class for anything in that example, but at the time it seemed like it might be useful later, e.g., in this really clunky implementation of the click to toggle idea: http://jsfiddle.net/C4h7s/1/

(I don't really have time or motivation to tidy those demos up, but at least they should give you some ideas of one way to proceed. Or one way not to proceed, depending on how silly you think my code is.)

Upvotes: 0

Bergi
Bergi

Reputation: 664970

If you need to get the text contents between the <br />s:

  1. select the <font> element, e.g. with .getElementsByTagName()
  2. get its childNodes and loop over them:
    • If its node type is 1, it would be one of your <br /> elements - check with .nodeName (else you'd need to expand your loop over the elements children)
    • If its node type is 3, it is a text node. Get the text value and match it to your content scheme

You then should be able to build a more suitable DOM from that. You even could reuse the text nodes and just wrap them in proper tags.

Upvotes: 0

Related Questions