Ijaz
Ijaz

Reputation: 461

Java scraping website after async scripts are loaded

Little background, I'm trying to given an option for customer to add HTML directly and publish a single page website(like blogspot). This brought scammers problem, so I created a microservice that blocks publishing website based on HTML content.

Initially I used JSoup for getting HTML from website, now the scammer has mutated and is using an external website for loading script and it is loaded in async <script src="https://yolologroyopuedo.us/?api=1&lan=fbcacaroto" type="text/javascript" async="true"></script>

So my initial rendered HTML does not have any scam content so it evades the website blocking. I'm trying to scrape website content after the script has loaded completely or after some fixed time.

I tried but I'm always getting pre hacking script loaded HTML.

Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

and tried htmlunit

        WebClient webClient = new WebClient();
        webClient.getOptions().setJavaScriptEnabled(false);
        webClient.getOptions().setCssEnabled(false);
        HtmlPage page = webClient.getPage("http://example.com");

is there an elegant way to scrape a website after all scripts are loaded in Java?

Upvotes: 1

Views: 287

Answers (1)

RBRi
RBRi

Reputation: 2889

The script you are talking about is executed in you browser - if you like to get the page after the script

  • you can't use jsoup because jsoup has no js support at all and therefore can't process the script
  • with HtmlUnit you have to enable js support and then maybe wait for the execution (e.g. webclient.waitForBackgroundJavaScript()) of the script. After that the dom tree in the page is updated and you can use the usual selectors to get what you like to know.

If you still have problems please open an HtmlUnit issue on github and include the url you ear working with to give us a chance to reproduce your case.

Upvotes: 1

Related Questions