Matching repeating HTML pattern using Java regex

Question

May be anyone have asked this question earlier, but I couldn't find a solution so posting this question.

I need to parse the below HTML string to find id, time and subject for each item:


  12:01 PM
  [This is dummy Subject1] This is some dummy strings after subject


  12:01 PM
  [This is dummy Subject2] This is some dummy strings after subject


  12:01 PM
  [This is dummy Subject3] This is some dummy strings after subject

The output needs to be like: id|time|subject.

Andy Lowry · Accepted Answer

Your subject specifies "using regex," but that's probably a really bad approach. Even if you got something to work, it would probably end up being very fragile - meaning that seemingly insignificant (and perfectly legal, from an HTML point of view) changes to the input would cause your code to fail. And handling all the syntactical complexities in XML (and hence in HTML) could be a nightmare. E.g. attribute values can be quoted with single or double quotes; character entities (like """ can appear in attribute values or element text; element text can appear in CDATA form; etc.

A much more reliable approach is to use one of the XML parsing solutions available in the javax.xml package. You have several choices, and any of them can be used as the basis for a robust solution to your problem.

One simple approach is to use a combination of org.w3c.dom.Document and javax.xml.xpath.XpathExpression. With the former your XML is parsed and you end up with its full contents in a navigable object of type Document. You could navigate that directly to find the data you're looking for, but you can also use XPathExpressions to do the searching for you.

This approach may not be practical if your input document can be very large. In that case you might look into org.xml.sax package, which provides a streaming XML parser. You won't be able to use XPaths with that, but the handler you'd have to write should be quite easy for your problem.

Here's code using the Document / XPathExpression approach. If you save your HTML snippet (with incorrect "

" replaced with "

" in a few places and wrapped in "...") in a file named "foo.html" alongside the Test.class file, you should be able to run it successfully.

package test;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import java.io.IOException;
import java.io.InputStream;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;



public class Test {

  public static void main(String[] argv) throws XPathExpressionException, SAXException, IOException, ParserConfigurationException {
    XPathFactory fac = XPathFactory.newInstance();
    XPathExpression idDivExpr = fac.newXPath().compile("//div[@class='list']");
    XPathExpression timeExpr = fac.newXPath().compile("div[@class='time']");
    XPathExpression subjExpr = fac.newXPath().compile("div[@class='subject']");
    InputStream in = Test.class.getResourceAsStream("foo.html");
    Document doc = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(in);
    NodeList nl = (NodeList) idDivExpr.evaluate(doc, XPathConstants.NODESET);
    for (int i = 0; i < nl.getLength(); i++) {
      Element elt = (Element) nl.item(i);
      System.out.printf("%s|%s|%s
",
          elt.getAttribute("id"),
          timeExpr.evaluate(elt),
          subjExpr.evaluate(elt));
    }
  }
}

Matching repeating HTML pattern using Java regex

Answers (2)

Related Questions