Touseef Ahmed
Touseef Ahmed

Reputation: 3

Not able to download specific URL in java

I am writing following program to download the URL using Apache Common-IO and I am getting ReadTimeOut exception, Exception

java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at java.net.SocketInputStream.read(Unknown Source)
at sun.security.ssl.InputRecord.readFully(Unknown Source)
at sun.security.ssl.InputRecord.read(Unknown Source)
at sun.security.ssl.SSLSocketImpl.readRecord(Unknown Source)
at sun.security.ssl.SSLSocketImpl.readDataRecord(Unknown Source)
at sun.security.ssl.AppInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
at java.net.URL.openStream(Unknown Source)
at org.apache.commons.io.FileUtils.copyURLToFile(FileUtils.java:1456)
at com.touseef.stock.FileDownload.main(FileDownload.java:23)

Program

String urlStr = "https://www.nseindia.com/";
    File file = new File("C:\\User\\WorkSpace\\Output.txt");
    URL url;
    try {
        url = new URL(urlStr);
        FileUtils.copyURLToFile(url, file);
        System.out.println("Successfully Completed.");
    } catch (MalformedURLException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

Other site are able to download. Please suggest. Using commons-io-2.6 jar.

Upvotes: 0

Views: 488

Answers (1)

Robert
Robert

Reputation: 42585

It seems like this site is protected by some web gateway (DOS protection service like Akamai?). Clients seem to be fingerprinted by TLS connection and the HTTP request (headers) and only valid web browsers can connect to the site.

The following code uses Apache commons http client 4.5 and works at least at the moment:

    String urlStr = "https://www.nseindia.com/";
    File file = new File("C:\\User\\WorkSpace\\Output.txt");
    String userAgent = "-";

    CloseableHttpClient httpclient = HttpClients.custom().setUserAgent(userAgent).build();
    HttpGet httpget = new HttpGet(urlStr);
    httpget.addHeader("Accept-Language", "en-US");
    httpget.addHeader("Cookie", "");

    System.out.println("Executing request " + httpget.getRequestLine());
    try (CloseableHttpResponse response = httpclient.execute(httpget)) {
        System.out.println("----------------------------------------");
        System.out.println(response.getStatusLine());
        String body = EntityUtils.toString(response.getEntity());
        System.out.println(body);
        Files.writeString(file.toPath(), body);
    }

A request that e.g works from within Firefox does not work from Java (because the TLS connection with protocols and ciphers is different). I tried a few combinations using Apache commons http client. but is also fails (even though the same request works from Fiddler).

Hence using this web site from within Java is extremely difficult and even the code above works at the moment, the protection system can be adapted at any time so that it won't work again.

I would assume that such a site provides an API dedicated for program usage. Contact them and ask, that is the only advice I can give to you.

Upvotes: 1

Related Questions