Read all text between non-sibling HTML tags

Question

I have an HTML page (created from drupal) that, near the top of the page at a place of my choosing, has

and near the end, at a place of my choosing, has

In between those is some HTML written by users that will probably but not definitely be well formed.

The user can add additional tags like the above, so as to exclude content, eg:


This HTML here is excluded

Note that the exclusion block begins with a 'marker-end', as that matches the 'marker-start' at the beginning of the page to form a pair, and similarly the exclusion block ends with a 'marker-start' to pair with the 'marker-end' at the end of the document (or the start of another exclusion block).

While theoretically that exclusion block will be well formed, I will say again: written by users. Tags may legitimately be opened or closed in an uneven way (for example, the /div may be AFTER the marker-start), and so on. Basically, there is no guarantee that the markers will be siblings.

The user can add multiple excluded spans within the document.

I need a way to read the text (NOT the HTML) between each pair of 'marker-start' and 'marker-end', and that text (which will exclude any exclusion blocks) will be concatenated together. The markers may not (in fact almost certainly will not) be siblings in a balanced position, ie there will probably be tags that are opened but not closed, or vice versa, between them.

I have tried the methods suggested in How to select all content between two tags in jQuery and Get text between two elements JQUERY and hit problems on all of them.

In general, I have really struggled to have jQuery produce any useful results.

Can anyone suggest the simplest method to achieve this? I do have two solutions which I will outline in an answer for others to see but neither is perfect.

iCollect.it Ltd · Accepted Answer

You could try walking the entire DOM, recursively, and exclude elements based on prior start and end markers found:

As a simple example (if I understand your exclusion logic correctly):

JSFiddle: http://jsfiddle.net/fdductdg/2/

function walkDOM(node, func) {
    func(node);
    node = node.firstChild;
    while (node) {
        walkDOM(node, func);
        node = node.nextSibling;
    }
};

var inMarker = false;

walkDOM(document.body, function (node) {
    var $node = $(node);
    if ($node.is('span')) {
        if ($node.hasClass('marker-end')) {
            inMarker = false;
            console.log("end marker");
        } else if ($node.hasClass("marker-start")) {
            inMarker = true;
            console.log("start marker");
        }
    }
    if (node.nodeType == 3)
    {
        if (!inMarker)
        {
            // Not inside a marker, remove the text content
            node.textContent = "";
        }
    }
});

Update:

As you also wish to retain the original text, you can either collect it in a variable (as you appear to have done in comment) or wrap any matching text nodes in appropriate elements (e.g. a span with appropriate class) so that the excluded text can simply be styled-in/out, without destroying the content.

Read all text between non-sibling HTML tags

Answers (2)

Related Questions