Reputation: 7400
Using the below snippet of code implementing Tika (the article object is my own), I've come across URL's that redirect to the final page, I believe through a jQuery.extend command.
URL articleURL = new URL(article.getLink());
stream = TikaInputStream.get(articleURL);
articleBytes = IOUtils.toByteArray(stream);
if (articleBytes.length == 0) {
return null;
} else {
article.setContentLength((long) articleBytes.length);
}
ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();
parser.parse(new ByteArrayInputStream(articleBytes), new BoilerpipeContentHandler(textHandler), metadata, context);
Tika follows the redirection URL just fine, however I want to know what the final URL is. Is there any way to get the actual, final URL from Tika?
An example URL that has a redirect in it is:
Upvotes: 1
Views: 374
Reputation: 7400
Based on this answer: https://stackoverflow.com/a/5270162/4471711
I used the following code:
URLConnection con = new URL(article.getLink()).openConnection();
con.connect();
stream = TikaInputStream.get(con.getInputStream());
articleBytes = IOUtils.toByteArray(stream);
article.setLink(con.getURL().toExternalForm());
con.getURL().toExternalForm() returned the new (redirected) url.
Upvotes: 1