Reputation: 1
public class SearchWalm {
public static void main(String[] args) throws IOException, InterruptedException {
HttpClient client = HttpClient.newHttpClient();
HttpRequest request = HttpRequest.newBuilder()
.uri(URI.create("https://www.walmart.ca/en/electronics/N-32+103/page-13?sortBy=newest&orderBy=DESC"))
.GET()
.build();
HttpResponse<String> response = client.send(request,
HttpResponse.BodyHandlers.ofString());
System.out.println(response.body());
}
}
I'm trying to write a program that will sift through pages on the walmart clearance section then select a keyword and tell me what page it found the keyword on.
I'm getting the error "no javascript" and "your web browser is not running javascript". Do I need to run this through a browser or is there a Java only way of doing this?
Upvotes: 0
Views: 249
Reputation: 1422
Headless browser can solve scraping problems but unfortunately, your website loads content on-demand using javascript. To scrape on-demand data load, need an actual browser.
We use Jsoup and Selenium WebDeiver to solve this problem. Selenium WebDriver can allow Implicitly Wait(you set a timer) or Fluent Wait. Using this wait, we will wait until desired data loaded completely. After receiving we content, you parse data using jsoup and find out your desired result.
You also need Chrome/Firefox browser installed in your machine and need ChromeDriver/FirefoxDriver.
Now you run the below code, which can search and show title from the search result.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import java.io.IOException;
import java.util.concurrent.TimeUnit;
public class WebScraperJsOnload {
public static void main(String[] args) throws IOException {
String queryString = "https://www.walmart.ca/en/electronics/N-32+103/page-13?sortBy=newest&orderBy=DESC";
WebDriver driver = new ChromeDriver();
driver.manage().timeouts().implicitlyWait(20, TimeUnit.SECONDS);
driver.get(queryString);
Document doc = Jsoup.parse(driver.getPageSource());
Elements newsHeadlines = doc.select(".title");
for (Element headline : newsHeadlines) {
log("Log: %s",headline.html());
}
}
private static void log(String msg, String... vals) {
System.out.println(String.format(msg, vals));
}
}
Maven dependencies for this imports,
<dependency>
<!-- jsoup HTML parser library @ https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-java</artifactId>
<version>3.141.59</version>
</dependency>
This code output looks like,
Log: <h2 class="thumb-header">Clearance Sale DR-E15 Fake Battery DC Coupler Battery Holder Mount Plate Power Supply Set Black</h2>
Log: <h2 class="thumb-header">Clearance Sale 2 in 1 4.7 inch Wireless U Disk Memory Expansion Phone Case for iPhone 6/6S/7 Red</h2>
Log: <h2 class="thumb-header">Clearance Sale USB Charging Power LED Selfie Ring Filling Light With Mobile Phone Clip Holder Black</h2>
Log: <h2 class="thumb-header">Clearance Sale Nillkin Protective Cover Plastic Hard Back Case Protect Mobile Phone Shell Red</h2>
Log: <h2 class="thumb-header">Clearance Sale Children'S Alarm Clock Creative Cute Cartoon Luminous Led Electronic Clock Pink</h2>
...
...
For the complete project, download this Github Repo
Upvotes: 1
Reputation: 1395
It's kinda a java thing. You send different set of headers when doing request inside Java. I tried that url and works ok when you attach "Accept: */*" header.
You can't do it with your current implementation, reimplement it with HttpClient and add the missing header.
Upvotes: 0