Vahid Hashemi
Vahid Hashemi

Reputation: 5240

how to get only <html> data </html> from internet using java?

I'm using following code for retrieving data from internet but I get HTTP headers also which is useless for me.

URL url = new URL(webURL);
            URLConnection conn = url.openConnection();
            BufferedReader in = new BufferedReader(new InputStreamReader(conn.getInputStream()));
            String inputLine;

            while ((inputLine = in.readLine()) != null) 
                System.out.println(inputLine);
            in.close();

how can I get html data only not any headers or whatsoever.

regards

Upvotes: 1

Views: 813

Answers (4)

KV Prajapati
KV Prajapati

Reputation: 94645

You are retrieving correct data using URLConnecton. However if you want to read/access a particular html tag you must have to use HTML parser. I suggest you to use jSoup.

Example:

org.jsoup.nodes.Document doc = org.jsoup.Jsoup.connect("http://your_url/").get();
org.jsoup.nodes.Element head=doc.head(); // <head> tag content
org.jsoup.nodes.Element body=doc.body(); // <body> tag content

System.out.println(doc.text()); // Only text inside the <html>

Upvotes: 1

Fantasy Shao
Fantasy Shao

Reputation: 613

You are meaning to translate html into text? If so, you can use org.htmlparser.*. Take a loo at http://htmlparser.sourceforge.net/

Upvotes: 0

Wayne
Wayne

Reputation: 60414

Retrieving and parsing a document using TagSoup:

Parser p = new Parser();
SAX2DOM sax2dom = new SAX2DOM();
URL url = new URL("http://stackoverflow.com");
p.setContentHandler(sax2dom);
p.parse(new InputSource(new InputStreamReader(url.openStream())));
org.w3c.dom.Node doc = sax2dom.getDOM();

The TagSoup and SAX2DOM packages are:

import org.ccil.cowan.tagsoup.Parser;
import org.apache.xalan.xsltc.trax.SAX2DOM;

Writing the contents to System.out:

TransformerFactory tFact = TransformerFactory.newInstance();
Transformer transformer = tFact.newTransformer();
Source source = new DOMSource(doc);
Result result = new StreamResult(System.out);
transformer.transform(source, result);

These all come from import javax.xml.transform.*

Upvotes: 1

Shraddha
Shraddha

Reputation: 2335

You can parse the complete data to search for the string and accept the data only between html tags

Upvotes: 0

Related Questions