Jurek Kozyra
Jurek Kozyra

Reputation: 93

Writing a simple web crawler that interacts with the browser (Java)

I need to create an automated process (preferably using Java) that will:

  1. Open browser with specific url.
  2. Login, using the username and password specified.
  3. Follow one of the links on the page.
  4. Refresh the browser.
  5. Log out.

This is basically done to gather some statistics for analysis. Every time a user follows the link a bunch of data is generated for this particular user and saved in database. The thing I need to do is, using around 10 fake users, ping the page every 5-15 min.

Can you tink about simple way of doing that? There has to be an alternative to endless login-refresh-logout manual process...

Upvotes: 2

Views: 4562

Answers (4)

Syntax
Syntax

Reputation: 2197

Use HtmlUnit if you want

  1. FAST
  2. SIMPLE

java based web interaction/crawling.

For example: here is some simple code showing a bunch of output and an example of accessing all IMG elements of the loaded page.

public class HtmlUnitTest {
  public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
    final WebClient webClient = new WebClient();
    final HtmlPage page = webClient.getPage("http://www.google.com");
    System.out.println(page.getTitleText());

    for (HtmlElement node : page.getHtmlElementDescendants()) {
      if (node.getTagName().toUpperCase().equals("IMG")) {
        System.out.println("NAME: " + node.getTagName());
        System.out.println("WIDTH:" + node.getAttribute("width"));
        System.out.println("HEIGHT:" + node.getAttribute("height"));
        System.out.println("TEXT: " + node.asText());
        System.out.println("XMl: " + node.asXml());
      }
    }
  }
}

Example #2 Accessing named input fields and entering data/clicking:

final HtmlPage page = webClient.getPage("http://www.google.com");

HtmlElement inputField = page.getElementByName("q");
inputField.type("Example input");

HtmlElement btnG = page.getElementByName("btnG");
Page secondPage = btnG.click();

if (secondPage instanceof HtmlPage) {
  System.out.println(page.getTitleText());
  System.out.println(((HtmlPage)secondPage).getTitleText());
}

NB: You can use page.refresh() on any Page object.

Upvotes: 1

sawu
sawu

Reputation: 61

It's not Java, but Javascript. You could do something like:

window.location = "<url>"
document.getElementById("username").value = "<email>";    
document.getElementById("password").value = "<password>";

document.getElementById("login_box_button").click();

...

etc

With this kind of structure you can easily cover 1-3. Throw in some for loops for page refreshes and you're done.

Upvotes: 1

Redlab
Redlab

Reputation: 3118

You could use Jakarta JMeter

Upvotes: 0

Aaron Digulla
Aaron Digulla

Reputation: 328594

Try Selenium.

Upvotes: 5

Related Questions