Reputation: 2120
I have just started working on a content extraction project. First I am trying to the Image URLs in a webpage. In some cases, the "src" attribute of "img" has relative URL. But I need to get the complete URL.
I was looking for some Java library to achieve this and thought Jsoup will be useful. Is there any other library to achieve this easily?
Upvotes: 0
Views: 906
Reputation: 888
If you just need to get the complete URL from a relative one, the solution is simple in Java:
URL pageUrl = base_url_of_the_html_page;
String src = src_attribute_value; //relative or absolute URL
URL imgUrl = new URL(pageUrl, src);
The base URL of the HTML page is usually just the URL you have obtained the HTML code from. However, a <base> tag used in the document header, may be used for specifying a different base URL (but it's not used very frequently).
You may use Jsoup or just a DOM parser for obtaining the src attribute values and for finding the eventual base tag.
Upvotes: 1