Brooks
Brooks

Reputation: 7400

Apache Tika - How to access a redirect URL

Using the below snippet of code implementing Tika (the article object is my own), I've come across URL's that redirect to the final page, I believe through a jQuery.extend command.

URL articleURL = new URL(article.getLink());
stream = TikaInputStream.get(articleURL);
articleBytes = IOUtils.toByteArray(stream);
if (articleBytes.length == 0) {
    return null;
} else {
    article.setContentLength((long) articleBytes.length);
}

ContentHandler textHandler = new BodyContentHandler();
Metadata metadata = new Metadata();
AutoDetectParser parser = new AutoDetectParser();
ParseContext context = new ParseContext();

parser.parse(new ByteArrayInputStream(articleBytes), new BoilerpipeContentHandler(textHandler), metadata, context);

Tika follows the redirection URL just fine, however I want to know what the final URL is. Is there any way to get the actual, final URL from Tika?

An example URL that has a redirect in it is:

http://sbs.feedsportal.com/c/34692/f/637529/s/4d7e2cd0/sc/14/l/0L0Ssbs0N0Bau0Cnews0Carticle0C20A160C0A20C110Cscientists0Emaking0Ezika0Edetection0Ekits/story01.htm--2016-02-27

Upvotes: 1

Views: 374

Answers (1)

Brooks
Brooks

Reputation: 7400

Based on this answer: https://stackoverflow.com/a/5270162/4471711

I used the following code:

URLConnection con = new URL(article.getLink()).openConnection();
con.connect();
stream = TikaInputStream.get(con.getInputStream());
articleBytes = IOUtils.toByteArray(stream);
article.setLink(con.getURL().toExternalForm());

con.getURL().toExternalForm() returned the new (redirected) url.

Upvotes: 1

Related Questions