Reputation: 1

Java StringTokenizer troubles - Newbie

I know I'm probably being incredibly stupid here, but can anybody shed any light on my problem? I'm trying to extract the title from a string containing html...

 public static void main(String args[]) {
  System.out.println(getTitle("<title>this is it</title>"));
 }

 public static String getTitle(String a) {
  StringTokenizer token = new StringTokenizer(a, "<title>", false);
  return token.nextToken("</title>");
 }

Keeps returning "h" and I can't work out why! Am is being naive?

Cheers

Upvotes: 0

Answers (5)

extraneon

Reputation: 23970

If you are parsing HTML the the best way might be HTML Cleaner, according to this SO post.

I would recommend using this domain specific library, as it will also give you an easy way to extend the functionality of your app when required. Or help you with another app if that's also parsing HTML.

Upvotes: 0

Péter Török

Reputation: 116306

I think your problem lies here (quote from the API doc, text bolded by me):

"The set of delimiters (the characters that separate tokens) may be specified either at creation time or on a per-token basis."

That is, the delimiter is not a string, but a set of characters. When you pass "<title>" as second parameter, you tell your tokenizer that the delimiters are any of the characters <, t, i, t, l, e or >. Thus the tokenizer dutifully skips all the characters in the first tag and then t, and returns h because that is not in the set of tokens you gave it, but the next character (e) is.

So StringTokenizer is not quite what you need here. Note also this remark from the API docs:

"StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead."

Or use a third party library, as has been noted by others.

Upvotes: 2

tonio

Reputation: 10541

You cannot use StringTokenizer this way. See the javadoc http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html

The delims argument contains the set of characters that are considered as delimiters in the string. Thus here, you have "<", "t", "i", ... as delimiters.

for that kind of work, you really should consider using an html or xml dedicated library. You could also use "<>" as delimiters, and implement of minimalist html parser suiting your needs, but this will probably lead to bugs, headaches, and more bugs once your minimal needs extends.

Upvotes: 0

Nils Schmidt

Reputation: 3720

I am not sure if StringTokenizer is the best class to use in your scenario. Maybe you can solve your task by using String.subString(int, int). As BearsWillEatYou indicated, if you want to do more sophisticated HTML Parsing, use some third party library.

public static void main(String args[]) {
    System.out.println(getTitle("<title>this is it</title>"));
}

public static String getTitle(String a) {
    return a.substring(a.indexOf("<title>") + "<title>".length(), a.indexOf("</title>"))
}

Upvotes: 2

tonio

Reputation: 10541

The delimiter you specified is "", which is the empty string. There is an empty string between the "t" and "h" at the start ofyour string, thus nextToken returns "t". It is normal, and works as specified. See http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html

Upvotes: 0

Java StringTokenizer troubles - Newbie

Answers (5)

Related Questions