user1743740
user1743740

Reputation: 11

How do you parse links from html using Java?

I'm very much a Java novice. For my class we have to print out all of the links that are to be parsed from a user-inputted html source code.

Basically, I want to figure out how to take the string of the link that comes after the href attribute and do that for all links on the webpage, without using external methods (i.e. using arrays, substrings, and methods of strings but not importing other libraries).

Upvotes: 1

Views: 2690

Answers (3)

linski
linski

Reputation: 5094

I don't know what class you are at, so the regular expression solution might be too advanced for you.
It might be the case if you are first year for example, but I can't really tell.

You could do it using substring or arrays but that is waaaay too much coding. That's why standard Java regular expressions exist:

String A_TAG_MATCHING_GROUP = "<a>([^<>]*)</a>";

Matcher matcher = Pattern.compile(A_TAG_MATCHING_GROUP).matcher("<html>\n<head>d\nadas</head><body><a>LINK_DESC_ONE</a>dsdasd<a>LINK_DESC_2</a></body></html>");
String url, linkDescription;
while (matcher.find()) {
        System.out.println(matcher.group(1));
}

Compile and run this code, then continue reading!

The crucial part is A_TAG_MATCHING_GROUP regular expression. As it is now, it will match an exact string " <a>" followed by:

  • none or as many characther's as you want (as denoted by star - *)
  • characther as stated above is defined as any character that is not (as denoted by caret - ^) "<" or ">" (exact term when something is inside square brackets - [ ] is character class)

So, if you write the A_TAG_MATCHING_GROUP regular expression well, with

matcher.group(i);

you'll get the url. Since it is for your class I won't write it for you :) Modify the matcher argument and play a little (change the hardcoded html string). Get some real html's and compare your output with real tool's output like this one.

Of course, you'll must read the given tutorial (this might be useful also) before, and here are relevant API links:

But, if you want to use "arrays and substrings", you could use the following algorithmn:

  1. read the html character per character e.g.

    String html ; for (Character c : s.toCharArray()) { //
    }

  2. when you get to the "<" remeber it (e.g. in a boolean variable first_char_of_a_tag_found)

  3. decide will you immediatley want it to be followed by "a" char or you will allow line breaks and spaces. when you detect "a" remeber it in a boolean variable.

  4. when you reach " href=" " start remebering the contents - might use a [substring()](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#substring(int, int)) there on html string, and store its return value in a StringBuilder variable called url.

This a very low-level algorithm, but it will do the job. It requires a lot of coding and it is a monolithic, procedural approach.

Basically, loosley speaking you will be implementing an regular expression "engine" - the one I described in the first part of the post.

I programmed them both as assignments (first one for the job interview in Java, and the second one in C as an entry exam for a Java collegium) but in spite of the usual learning methodology (the second one first) I'd recommend the first one first - but it depends are you on tight schedule and what's your current knowledge.

Hope it helps :)

EDIT:

You can't parse HTML with regular expressions, but you can parse out url's from a tags with them. Not to be confused though, I'd definetly go with Jerry as Anton suggested.

You can see that Jerry like solutions are waay better in a real life from merely observing the size of his and mine post and time needed to process it, for starters :))

Upvotes: 2

btiernay
btiernay

Reputation: 8129

You might want to consider some of these ideas

Upvotes: 0

Anton Bessonov
Anton Bessonov

Reputation: 9803

Don't do it with Parser or RegExp. Try Jerry. Like (not tested):

Jerry doc = jerry(html);
doc.$("a").each(new JerryFunction() {
    public boolean onNode(Jerry $this, int index) {
        String href = $this.attr("href");
        System.out.println(href);
    }
}

or any html-friendly query language. Because of non-externals requirements try Trying to parse links in an HTML directory listing using Java

Upvotes: 5

Related Questions