GreenGodot
GreenGodot

Reputation: 6753

Java Http Request that only returns certain elements I want

Is there a method in Java to make a HTTP request to a webpage where the response will only be some specific elements I want instead of the whole document?

For example, if I was to request a <div> called "example", the response would be only that element and not the rest of the fluff that exists on the page, which I do not need.

Most methods I looked at, involve getting an entire HTML page and then parsing it. I want to look at the page and then just pluck out the div I want and only have that as a response. The pages I am dealing with contain a lot of advert content I want to ignore.

Upvotes: 1

Views: 1536

Answers (3)

Parker
Parker

Reputation: 7494

HTTP has nothing to do with the content of the page, it is simply a protocol that governs server requests and responses.

I understand what you want to do, you've just asked slightly the wrong question. Don't worry about HTTP, that is simply the protocol that governs server requests and responses (GET, PUT, POST, HEAD, OPTIONS).

The problem you are describing can only be handled after retrieval of the content is completed. You need to be working with the Document Object Model (DOM) that is the foundation of XML and XHTML. This means that you will need to familiarize yourself with DOM, and maybe XPath and XSL as well.

The functionality you are asking for can be implemented in many ways, but it generally boils down to a sequence of non-trivial operations:

  1. Retrieve page content for URL (including negotiating encodings, HTTP redirects and protocol changes).
  2. Clean up non-well-formed content (i.e., unclosed or improperly nested tags, e.g., using JTidy).
  3. Parse page content into DOM.
  4. Traverse DOM to find the nodes you are interested in (e.g., via DOM or XPath).
  5. Build output DOM (e.g. via org.w3c.dom classes).
  6. Write output DOM to file (combination of java.io and org.w3c.dom).

While it is possible to implement this from scratch, there are already a few open source projects that have this functionality, try something like jsoup: Java HTML Parser.

Upvotes: 1

Manindar
Manindar

Reputation: 998

No its not possible. The HTTP Get/post calls will return complete web page information but not some portion of it.

Upvotes: 1

Tim
Tim

Reputation: 43314

That's not possible. The way the web works is you send a HTTP GET request to a page, and it returns the entire page. What you do with it (parsing, etc) is up to you, but you have no influence over the HTTP protocol.

This could however be realised if you host a webpage using a custom server/API that you implemented yourself. You could send a request with certain parameters specifying what you needed, and it could parse the html page server side.

Upvotes: 2

Related Questions