Satish Patel
Satish Patel

Reputation: 1844

How do I parse a string to HTML DOM in java

My java program is storing the content of web page in the string sb and I want to parse the string to HTML DOM. How do I do that?

import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.net.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class Scraper {
    public static void main(String[] args) throws IOException, SAXException {
        URL u;
        try {
            u = new URL("https://twitter.com/ssjsatish");
            URLConnection cn = u.openConnection();
            System.out.println("content type:  "+cn.getContentType());
            InputStream is = cn.getInputStream();
            long l = cn.getContentLengthLong();
            StringBuilder sb = new StringBuilder();
            if (l!=0) {
                int c;
                while ((c = is.read()) != -1) {
                   sb.append((char)c);
                }
                is.close();
                System.out.println(sb);
                DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
                InputSource i = new InputSource();
                i.setCharacterStream(new StringReader(sb.toString()));
                Document doc = db.parse(i);
            }
        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }
    }
}

Upvotes: 1

Views: 3194

Answers (1)

jjm
jjm

Reputation: 6198

You don't want to use an XML parser to parse HTML, because not all valid HTML is valid XML. I would recommend using a library specifically designed to parse "real-world" HTML, for example I have had good results with jsoup, but there are others. Another advantage of using this sort of library is that their APIs are designed with Web Scraping in mind, and provide much simpler ways of accessing data in the HTML document.

Upvotes: 3

Related Questions