Why is my Jsoup Code not Returning the Correct Elements?

Question

I am working on an app in Android Studio and am having some trouble web-scraping with JSoup. I have successfully connected to the webpage and returned some basic elements to test the library, but now I cannot actually get the elements I need for my app.

I am trying to get a number of elements with the "data-at" attribute. The weird thing is, a few elements with the "data-at" attribute are returned, but not the ones I am looking for. For whatever reason my code is not extracting all of the elements that share the "data-at" attribute on the web page.

This is the URL of the webpage I am scraping: https://express.liatoyotaofcolonie.com/inventory?f=dealer.name%3ALia%20Toyota%20of%20Colonie&f=submodel%3ACamry&f=trim%3ALE&f=year%3A2020

The method containing the web-scraping code:

@Override
    protected String doInBackground(Void... params) {
        String title = "";
        Document doc;
        Log.d(TAG, queryString.toString());
        try {
            doc = Jsoup.connect(queryString.toString()).get();
            Elements content = doc.select("[data-at]");
            for (Element e: content) {
                Log.d(TAG, e.text());
            }
        } catch (IOException e) {
            Log.e(TAG, e.toString());
        }
        return title;
    }

The results in Logcat

The element I want to retrieve

One of the elements that is actually being retrieved

Antoniossss · Accepted Answer

This is because some of the content - including the one you are looking for - is created asyncronously and is not present in initial DOM (Javascript ;))

When you view the source of the page you will notice that there is only 17 data-at occurences, while running document.querySelector("[data-at]") 29 nodes are returned.

What you are able to get in the JSoup is static content of the page (initial DOM). You wont be able to fetch dynamically created content as you do not run required JS scripts.

In order to overcome this, you will have to either fetch and parse required resources manually (eg trace what AJAX calls are made by the browser) or use headless browser setup. Selenium + headless Chrome should be enough.

Letter option will allow you to scrape ANY posible web application, including SPA apps, which is not possible using plaing Jsoup.

Why is my Jsoup Code not Returning the Correct Elements?

The method containing the web-scraping code:

The results in Logcat

The element I want to retrieve

One of the elements that is actually being retrieved

Answers (2)

Related Questions