Reputation: 734
I am working on an app in Android Studio and am having some trouble web-scraping with JSoup. I have successfully connected to the webpage and returned some basic elements to test the library, but now I cannot actually get the elements I need for my app.
I am trying to get a number of elements with the "data-at" attribute. The weird thing is, a few elements with the "data-at" attribute are returned, but not the ones I am looking for. For whatever reason my code is not extracting all of the elements that share the "data-at" attribute on the web page.
This is the URL of the webpage I am scraping: https://express.liatoyotaofcolonie.com/inventory?f=dealer.name%3ALia%20Toyota%20of%20Colonie&f=submodel%3ACamry&f=trim%3ALE&f=year%3A2020
@Override
protected String doInBackground(Void... params) {
String title = "";
Document doc;
Log.d(TAG, queryString.toString());
try {
doc = Jsoup.connect(queryString.toString()).get();
Elements content = doc.select("[data-at]");
for (Element e: content) {
Log.d(TAG, e.text());
}
} catch (IOException e) {
Log.e(TAG, e.toString());
}
return title;
}
Upvotes: 0
Views: 1191
Reputation: 32517
This is because some of the content - including the one you are looking for - is created asyncronously and is not present in initial DOM (Javascript ;))
When you view the source of the page you will notice that there is only 17 data-at
occurences, while running document.querySelector("[data-at]")
29 nodes are returned.
What you are able to get in the JSoup is static content of the page (initial DOM). You wont be able to fetch dynamically created content as you do not run required JS scripts.
In order to overcome this, you will have to either fetch and parse required resources manually (eg trace what AJAX calls are made by the browser) or use headless browser setup. Selenium + headless Chrome should be enough.
Letter option will allow you to scrape ANY posible web application, including SPA apps, which is not possible using plaing Jsoup.
Upvotes: 1
Reputation: 1
I don't quite know what to do about this, but I'm going to try one more time... The "Problematic Lines" in your code are these:
doc = Jsoup.connect(queryString.toString()).get(); Elements content = doc.select("[data-at]");
It is the queryString
that you have requested - the URL
points to a page that contains quite a bit of script code. When you load up a browser and click the button (or menu-option) that reads: "View Source"
, the HTML
you see is not the same exact HTML
that is broadcast to and received by JSoup.
If the HTML
that is broadcast contains any <SCRIPT TYPE="text/javascript"> ... </SCRIPT>
in it (and the named URL
in your question does), AND those <SCRIPT>
tags are involved in the initial loading of the page, then JSoup will not know anything about it... It only parses what it receives, it cannot process any dynamic content.
There are four ways that I know of to get the "Post Script Loaded" version of the HTML
from a dynamic web-page, and I will type them here, now. The first is likely the most popular method (in Java) that I have heard about on Stack Overflow:
post-script processed HTML
. Again, there is no way JSoup can retrieve HTML that is sent to the browser by script (JS/AJAX/Angular/React) since it just a parser."WebView"
(documented here), that I have recently been told about (yesterday ... but it has been out for years) that will execute script in a browser, and return the HTML. Here is an answer that shows "JavaScript Injection" to retrieve DOM Tree elements from a "WebView" instance (which is how I was told it was done)post-script processed HTML
below.To run the Splash API
(only if you have access to the docker
loading program) ... You start a Splash Server
as below. These two lines are typed into a GCP (Google Cloud Platform)
Shell instance, and the server starts right up without any configurations:
Pull the image:
$ sudo docker pull scrapinghub/splash
Start the container:
$ sudo docker run -it -p 8050:8050 --rm scrapinghub/splash
In your code, just prepend the String to your
URL's
:
"http://localhost:8050/render.html?url="
So in your code, you would use the following command (instead), and the script would (more likely) load all the HTML Elements that you are not finding:
String SPLASH_URL = "http://localhost:8050/render.html?url="; doc = Jsoup.connect(SPLASH_URL + queryString.toString()).get();
Upvotes: 0