Reputation: 1
I know I'm probably being incredibly stupid here, but can anybody shed any light on my problem? I'm trying to extract the title from a string containing html...
public static void main(String args[]) {
System.out.println(getTitle("<title>this is it</title>"));
}
public static String getTitle(String a) {
StringTokenizer token = new StringTokenizer(a, "<title>", false);
return token.nextToken("</title>");
}
Keeps returning "h" and I can't work out why! Am is being naive?
Cheers
Upvotes: 0
Views: 374
Reputation: 23970
If you are parsing HTML the the best way might be HTML Cleaner, according to this SO post.
I would recommend using this domain specific library, as it will also give you an easy way to extend the functionality of your app when required. Or help you with another app if that's also parsing HTML.
Upvotes: 0
Reputation: 116306
I think your problem lies here (quote from the API doc, text bolded by me):
"The set of delimiters (the characters that separate tokens) may be specified either at creation time or on a per-token basis."
That is, the delimiter is not a string, but a set of characters. When you pass "<title>"
as second parameter, you tell your tokenizer that the delimiters are any of the characters <
, t
, i
, t
, l
, e
or >
. Thus the tokenizer dutifully skips all the characters in the first tag and then t
, and returns h
because that is not in the set of tokens you gave it, but the next character (e
) is.
So StringTokenizer
is not quite what you need here. Note also this remark from the API docs:
"StringTokenizer
is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split
method of String
or the java.util.regex
package instead."
Or use a third party library, as has been noted by others.
Upvotes: 2
Reputation: 10541
You cannot use StringTokenizer this way. See the javadoc http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html
The delims
argument contains the set of characters that are considered as delimiters in the string. Thus here, you have "<", "t", "i", ... as delimiters.
for that kind of work, you really should consider using an html or xml dedicated library. You could also use "<>" as delimiters, and implement of minimalist html parser suiting your needs, but this will probably lead to bugs, headaches, and more bugs once your minimal needs extends.
Upvotes: 0
Reputation: 3720
I am not sure if StringTokenizer is the best class to use in your scenario. Maybe you can solve your task by using String.subString(int, int). As BearsWillEatYou indicated, if you want to do more sophisticated HTML Parsing, use some third party library.
public static void main(String args[]) {
System.out.println(getTitle("<title>this is it</title>"));
}
public static String getTitle(String a) {
return a.substring(a.indexOf("<title>") + "<title>".length(), a.indexOf("</title>"))
}
Upvotes: 2
Reputation: 10541
The delimiter you specified is "", which is the empty string. There is an empty string between the "t" and "h" at the start ofyour string, thus nextToken returns "t". It is normal, and works as specified. See http://java.sun.com/j2se/1.4.2/docs/api/java/util/StringTokenizer.html
Upvotes: 0