LeO
LeO

Reputation: 5258

Merging same elements in JSoup

I have the HTML string like

<b>test</b><b>er</b>
<span class="ab">continue</span><span> without</span>

I want to collapse the Tags which are similar and belong to each other. In the above sample I want to have

<b>tester</b>

since the tags have the same tag withouth any further attribute or style. But for the span Tag it should remain the same because it has a class attribute. I am aware that I can iterate via Jsoup over the tree.

Document doc = Jsoup.parse(input);
for (Element element : doc.select("b")) {
}

But I'm not clear how look forward (I guess something like nextSibling) but than how to collapse the elements?

Or exists a simple regexp merge?

The attributes I can specify on my own. It's not required to have a one-fits-for-all Tag solution.

Upvotes: 0

Views: 631

Answers (2)

LeO
LeO

Reputation: 5258

I tried to update the code from @Krystian G but my edit was rejected :-/ Therefore I post it as an own post. The code is an excellent starting point but it fails if between the tags a TextNode appears, e.g.

<span> no class but further</span> (in)valid <span>spanning</span> would result into a

<span> no class but furtherspanning</span> (in)valid

Therefore the corrected code looks like:

public class StackOverflow60704600 {

    public static void main(final String[] args) throws IOException {
        String test1="<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>";
        String test2="<b>test</b><b>er<a>123</a></b>";
        String test3="<span> no class but further</span>   <span>spanning</span>";
        String test4="<span> no class but further</span> (in)valid <span>spanning</span>";
        Document doc = Jsoup.parse(test1);
        mergeSiblings(doc, "b");
        System.out.println(doc);
    }

 private static void mergeSiblings(Document doc, String selector) {
    Elements elements = doc.select(selector);
    for (Element element : elements) {
      Node nextElement = element.nextSibling();
      // if the next Element is a TextNode but has only space ==> we need to preserve the
      // spacing
      boolean addSpace = false;
      if (nextElement != null && nextElement instanceof TextNode) {
        String content = nextElement.toString();
        if (!content.isBlank()) {
          // the next element has some content
          continue;
        } else {
          addSpace = true;
        }
      }
      // get the next sibling
      Element nextSibling = element.nextElementSibling();
      // merge only if the next sibling has the same tag name and the same set of
      // attributes
      if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
          && nextSibling.attributes().equals(element.attributes())) {
        // your element has only one child, but let's rewrite all of them if there's more
        while (nextSibling.childNodes().size() > 0) {
          Node siblingChildNode = nextSibling.childNodes().get(0);
          if (addSpace) {
            // since we have had some space previously ==> preserve it and add it
            if (siblingChildNode instanceof TextNode) {
              ((TextNode) siblingChildNode).text(" " + siblingChildNode.toString());
            } else {
              element.appendChild(new TextNode(" "));
            }
          }
          element.appendChild(siblingChildNode);
        }
        // remove because now it doesn't have any children
        nextSibling.remove();
      }
    }
  }
}

Upvotes: 1

Krystian G
Krystian G

Reputation: 2941

My approach would be like this. Comments in the code

public class StackOverflow60704600 {

    public static void main(final String[] args) throws IOException {
        Document doc = Jsoup.parse("<b>test</b><b>er</b><span class=\"ab\">continue</span><span> without</span>");
        mergeSiblings(doc, "b");
        System.out.println(doc);

    }

    private static void mergeSiblings(Document doc, String selector) {
        Elements elements = doc.select(selector);
        for (Element element : elements) {
            // get the next sibling
            Element nextSibling = element.nextElementSibling();
            // merge only if the next sibling has the same tag name and the same set of attributes
            if (nextSibling != null && nextSibling.tagName().equals(element.tagName())
                    && nextSibling.attributes().equals(element.attributes())) {
                // your element has only one child, but let's rewrite all of them if there's more
                while (nextSibling.childNodes().size() > 0) {
                    Node siblingChildNode = nextSibling.childNodes().get(0);
                    element.appendChild(siblingChildNode);
                }
                // remove because now it doesn't have any children
                nextSibling.remove();
            }
        }
    }
}

output:

<html>
 <head></head>
 <body>
  <b>tester</b>
  <span class="ab">continue</span>
  <span> without</span>
 </body>
</html>

One more note on why I used loop while (nextSibling.childNodes().size() > 0). It turned out for or iterator couldn't be used here because appendChild adds the child but removes it from the source element and remaining childen are be shifted. It may not be visible here but the problem will appear when you try to merge: <b>test</b><b>er<a>123</a></b>

Upvotes: 1

Related Questions