Reputation: 461
Little background, I'm trying to given an option for customer to add HTML directly and publish a single page website(like blogspot). This brought scammers problem, so I created a microservice that blocks publishing website based on HTML content.
Initially I used JSoup for getting HTML from website, now the scammer has mutated and is using an external website for loading script and it is loaded in async
<script src="https://yolologroyopuedo.us/?api=1&lan=fbcacaroto" type="text/javascript" async="true"></script>
So my initial rendered HTML does not have any scam content so it evades the website blocking. I'm trying to scrape website content after the script has loaded completely or after some fixed time.
I tried but I'm always getting pre hacking script loaded HTML.
Document doc = Jsoup.connect("http://example.com")
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.post();
and tried htmlunit
WebClient webClient = new WebClient();
webClient.getOptions().setJavaScriptEnabled(false);
webClient.getOptions().setCssEnabled(false);
HtmlPage page = webClient.getPage("http://example.com");
is there an elegant way to scrape a website after all scripts are loaded in Java?
Upvotes: 1
Views: 287
Reputation: 2889
The script you are talking about is executed in you browser - if you like to get the page after the script
If you still have problems please open an HtmlUnit issue on github and include the url you ear working with to give us a chance to reproduce your case.
Upvotes: 1