Reputation: 1844
My java program is storing the content of web page in the string sb
and I want to parse the string to HTML DOM. How do I do that?
import java.io.IOException;
import java.io.InputStream;
import java.io.StringReader;
import java.net.*;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class Scraper {
public static void main(String[] args) throws IOException, SAXException {
URL u;
try {
u = new URL("https://twitter.com/ssjsatish");
URLConnection cn = u.openConnection();
System.out.println("content type: "+cn.getContentType());
InputStream is = cn.getInputStream();
long l = cn.getContentLengthLong();
StringBuilder sb = new StringBuilder();
if (l!=0) {
int c;
while ((c = is.read()) != -1) {
sb.append((char)c);
}
is.close();
System.out.println(sb);
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource i = new InputSource();
i.setCharacterStream(new StringReader(sb.toString()));
Document doc = db.parse(i);
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (ParserConfigurationException e) {
e.printStackTrace();
}
}
}
Upvotes: 1
Views: 3194
Reputation: 6198
You don't want to use an XML parser to parse HTML, because not all valid HTML is valid XML. I would recommend using a library specifically designed to parse "real-world" HTML, for example I have had good results with jsoup, but there are others. Another advantage of using this sort of library is that their APIs are designed with Web Scraping in mind, and provide much simpler ways of accessing data in the HTML document.
Upvotes: 3