Slowcoder
Slowcoder

Reputation: 2120

Image Extraction from webpage in java

I have just started working on a content extraction project. First I am trying to the Image URLs in a webpage. In some cases, the "src" attribute of "img" has relative URL. But I need to get the complete URL.

I was looking for some Java library to achieve this and thought Jsoup will be useful. Is there any other library to achieve this easily?

Upvotes: 0

Views: 906

Answers (1)

radkovo
radkovo

Reputation: 888

If you just need to get the complete URL from a relative one, the solution is simple in Java:

URL pageUrl = base_url_of_the_html_page;
String src = src_attribute_value; //relative or absolute URL
URL imgUrl = new URL(pageUrl, src);

The base URL of the HTML page is usually just the URL you have obtained the HTML code from. However, a <base> tag used in the document header, may be used for specifying a different base URL (but it's not used very frequently).

You may use Jsoup or just a DOM parser for obtaining the src attribute values and for finding the eventual base tag.

Upvotes: 1

Related Questions