Ethan h
Ethan h

Reputation: 1

"no javascript" error when trying to scrape web page

public class SearchWalm {
    public static void main(String[] args) throws IOException, InterruptedException {
        HttpClient client = HttpClient.newHttpClient();
        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create("https://www.walmart.ca/en/electronics/N-32+103/page-13?sortBy=newest&orderBy=DESC"))
                .GET()
                .build();

        HttpResponse<String> response = client.send(request,
                HttpResponse.BodyHandlers.ofString());

        System.out.println(response.body());
    }
}

I'm trying to write a program that will sift through pages on the walmart clearance section then select a keyword and tell me what page it found the keyword on.

I'm getting the error "no javascript" and "your web browser is not running javascript". Do I need to run this through a browser or is there a Java only way of doing this?

Upvotes: 0

Views: 249

Answers (2)

Kawser Habib
Kawser Habib

Reputation: 1422

Headless browser can solve scraping problems but unfortunately, your website loads content on-demand using javascript. To scrape on-demand data load, need an actual browser.

We use Jsoup and Selenium WebDeiver to solve this problem. Selenium WebDriver can allow Implicitly Wait(you set a timer) or Fluent Wait. Using this wait, we will wait until desired data loaded completely. After receiving we content, you parse data using jsoup and find out your desired result.

You also need Chrome/Firefox browser installed in your machine and need ChromeDriver/FirefoxDriver.

  • Mac users with Homebrew installed: brew tap homebrew/cask && brew cask install chromedriver
  • Debian based Linux distros: sudo apt-get install chromium-chromedriver
  • Windows users with Chocolatey installed: choco install chromedriver

Now you run the below code, which can search and show title from the search result.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import java.io.IOException;
import java.util.concurrent.TimeUnit;


public class WebScraperJsOnload {
    public static void main(String[] args) throws IOException {
        
        String queryString = "https://www.walmart.ca/en/electronics/N-32+103/page-13?sortBy=newest&orderBy=DESC";
        
        WebDriver driver = new ChromeDriver();
        driver.manage().timeouts().implicitlyWait(20, TimeUnit.SECONDS);
        driver.get(queryString);
        
        Document doc = Jsoup.parse(driver.getPageSource());
        
        Elements newsHeadlines = doc.select(".title");
        for (Element headline : newsHeadlines) {
            log("Log: %s",headline.html());
        }
        
    }

    private static void log(String msg, String... vals) {
        System.out.println(String.format(msg, vals));
    }
}

Maven dependencies for this imports,

<dependency>
    <!-- jsoup HTML parser library @ https://jsoup.org/ -->
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>
<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>3.141.59</version>
</dependency>

This code output looks like,

Log: <h2 class="thumb-header">Clearance Sale DR-E15 Fake Battery DC Coupler Battery Holder Mount Plate Power Supply Set Black</h2>
Log: <h2 class="thumb-header">Clearance Sale 2 in 1 4.7 inch Wireless U Disk Memory Expansion Phone Case for iPhone 6/6S/7 Red</h2>
Log: <h2 class="thumb-header">Clearance Sale USB Charging Power LED Selfie Ring Filling Light With Mobile Phone Clip Holder Black</h2>
Log: <h2 class="thumb-header">Clearance Sale Nillkin Protective Cover Plastic Hard Back Case Protect Mobile Phone Shell Red</h2>
Log: <h2 class="thumb-header">Clearance Sale Children'S Alarm Clock Creative Cute Cartoon Luminous Led Electronic Clock Pink</h2>
...
... 

For the complete project, download this Github Repo

Upvotes: 1

szatkus
szatkus

Reputation: 1395

It's kinda a java thing. You send different set of headers when doing request inside Java. I tried that url and works ok when you attach "Accept: */*" header.

You can't do it with your current implementation, reimplement it with HttpClient and add the missing header.

Upvotes: 0

Related Questions