Joy
Joy

Reputation: 4483

Extracting relative links from a web page in proper format using Jsoup

I have parsed the outlinks of a web page which I am going to parse again using Jsoup. But the problem is that, the links are of the form: ../../../pincode/india/andaman-and-nicobar- islands/. In this form I cannot parse them. So I have converted to absolute url using link.attr("abs:href") with the help of other post of stackoverflow.

Url of the first web page that I have parsed is: http://www.mapsofindia.com/pincode/india/. And the absolute URls that I have got after parsing is of the form http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/. But I cannot parse them further using Jsoup. So when I am executing the following statement:

Jsoup.parse("http://www.mapsofindia.com/../pincode/india/andaman-and-nicobar-islands/");

It is giving HTTP 400 error i.e. bad request. So I think there is some problem with the Urls. So can anyone please help me to solve the above problem to get the urls in proper manner so that I can parse them further. Thank you.

Upvotes: 0

Views: 519

Answers (1)

ollo
ollo

Reputation: 25350

please test these two things:

  1. try using link.absUrl("href") instead of link.attr("abs:href")
  2. Check the base uri (calling baseUri() on your element or document)

Btw. you better use connect() Method for this thing:

Document doc = Jsoup.connect("http://<your url here>").get();

Upvotes: 1

Related Questions