ArthurMorgan
ArthurMorgan

Reputation: 75

Why am I getting 403 status code in Java after a while?

When I try to check status codes within sites I face off 403 response code after a while. First when I run the code every sites send back datas but after my code repeat itself with Timer I see one webpage returns 403 response code. Here is my code.

public class Main {

    public static void checkSites() {
        Timer ifSee403 = new Timer();

        try {
            File links = new File("./linkler.txt");
            Scanner scan = new Scanner(links);
            ArrayList<String> list = new ArrayList<>();
            while(scan.hasNext()) {
                list.add(scan.nextLine());
            }
            File linkStatus = new File("LinkStatus.txt");
            if(!linkStatus.exists()){
                linkStatus.createNewFile();
            }else{
                System.out.println("File already exists");
            }
            BufferedWriter writer = new BufferedWriter(new FileWriter(linkStatus));
            for(String link : list) {
                try {
                    if(!link.startsWith("http")) {
                        link = "http://"+link;
                    }
                    URL url = new URL(link);
                    HttpURLConnection.setFollowRedirects(true);
                    HttpURLConnection http = (HttpURLConnection)url.openConnection();
                    http.setRequestMethod("HEAD");
                    http.setConnectTimeout(5000);
                    http.setReadTimeout(8000);

                    int statusCode = http.getResponseCode();
                    if (statusCode == 200) {
                        ifSee403.wait(5000);
                        System.out.println("Hello, here we go again");
                    }
                    http.disconnect();
                    System.out.println(link + " " + statusCode);
                    writer.write(link + " " + statusCode);
                    writer.newLine();
                } catch (Exception e) {
                    writer.write(link + " " + e.getMessage());
                    writer.newLine();

                    System.out.println(link + " " +e.getMessage());
                }
            }
            try {
                writer.close();

            } catch (Exception e) {
                System.out.println(e.getMessage());
            }

            System.out.println("Finished.");

        } catch (Exception e) {
            System.out.println(e.getMessage());
        }



    }

    public static void main(String[] args) throws Exception {


        Timer myTimer = new Timer();

        TimerTask sendingRequest = new TimerTask() {
            public void run() {
                checkSites();
            }
        };
        myTimer.schedule(sendingRequest,0,150000);

    }
}

How can I solve this? Thanks

Edited comment:

  1. I've added http.disconnect(); for closing connection after checked status codes.

  2. Also I've added

    if(statusCode == 200) {
    ifSee403.wait(5000);
    System.out.println("Test message);
    

    }

But it didn't work. Compiler returned current thread is not owner error. I need to fix this and change 200 with 403 and say ifSee403.wait(5000) and try it again the status code.

Upvotes: 1

Views: 941

Answers (1)

Y2020-09
Y2020-09

Reputation: 1

One "alternative" - by the way - to IP / Spoofing / Anonymizing would be to (instead) try "obeying" what the security-code is expecting you to do. If you are going to write a "scraper", and are aware there is a "bot detection" that doesn't like you debugging your code while you visit the site over and over and over - you should try using the HTML Download which I posted as an answer to the last question you asked.

If you download the HTML and save it (save it to a file - once an hour), and then write you HTML Parsing / Monitoring Code using the HTML contents of the file you have saved, you will (likely) be abiding by the security-requirements of the web-site and still be able to check availability.

If you wish to continue to use JSoup, that A.P.I. has an option for receiving HTML as a String. So if you use the HTML Scrape Code I posted, and then write that HTML String to disk, you can feed that to JSoup as often as you like without causing the Bot Detection Security Checks to go off.

If you play by their rules once in a while, you can write your tester without much hassle.

import java.io.*;
import java.net.*;

...

// This line asks the "url" that you are trying to connect with for
// an instance of HttpURLConnection.  These two classes (URL and HttpURLConnection)
// are in the standard JDK Package java.net.*

HttpURLConnection con = (HttpURLConnection) url.openConnection();

// Tells the connection to use "GET" ... and to "pretend" that you are
// using a "Chrome" web-browser.  Note, the User-Agent sometimes means 
// something to the web-server, and sometimes is fully ignored.

con.setRequestMethod("GET");
con.setRequestProperty("User-Agent", "Chrome/61.0.3163.100");

// The classes InputStream, InputStreamReader, and BufferedReader
// are all JDK 1.0 package java.io.* classes.

InputStream      is = con.getInputStream();
BufferedReader   br = new BufferedReader(new InputStreamReader(is));
StringBuffer     sb = new StringBuffer();
String           s;

// This reads each line from the web-server.
while ((s = br.readLine()) != null) sb.append(s + "\n");

// This writes the results from the web-server to a file
// It is using classes java.io.File and java.io.FileWriter

File outF = new File("SavedSite.html");
outF.createNewFile();
FileWriter fw = new FileWriter(outF);
fw.write(sb.toString());
fw.close();

Again, this code is very basic stuff that doesn't use any special JAR Library Code at all. The next method uses the JSoup library (which you have explicitly requested - even though I don't use it... It is just fine!) ... This is the method "parse" which will parse the String you have just saved. You may load this HTML String from disk, and send it to JSoup using:

Method Documentation: org.jsoup.Jsoup.parse​(File in, String charsetName, String baseUri)

If you wish to invoke JSoup just pass it a java.io.File instance using the following:

File f = new File("SavedSite.html");
Document d = Jsoup.parse(f, "UTF-8", url.toString());

I do not think you need timers at all...

AGAIN: If you are making lots of calls to the server. The purpose of this answer is to show you how to save the response of the server to a file on disk, so you don't have to make lots of calls - JUST ONE! If you restrict your calls to the server to once per hour, then you will (likely, but not a guarantee) avoid getting a 403 Forbidden Bot Detection Problem.

Upvotes: 1

Related Questions