Reputation: 11
This is a StringTokenizer
method, I want to print a new word on every line, but it's not working.
StringTokenizer st= new StringTokenizer(s,".|?||!|");
while(st.hasMoreTokens())
{
String t=st.nextToken();//extracting each word of the sentence
for(int i=0;i<st.countTokens();i++)
{
System.out.println(t);
}//printing each word of sentence
}
Sample input:
How are you?
Expected output:
How
Are
You
Upvotes: 0
Views: 70
Reputation: 6890
StringTokenizer
is a legacy class left in Java just for backward compatibility. As the documentation states, its use is discouraged.
StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code.
In your case, your code is not splitting the string s
as expected because the constructor was not provided with the right delimiters (the space character is missing). Also, the object was constructed with a same delimiter multiple times (the pipe).
To properly split the string How are you?
into [How
, are
, you
], you just had to pass a delimiter containing a question mark and a space.
StringTokenizer st = new StringTokenizer(s, " ?");
Furthermore, that extra for
loop within your while
has no point to be there. You are already iterating through the tokens of your string with hasMoreTokens()
, while extracting them with st.nextToken()
. At this point, you just need to print the extracted token to the console; the for
loop is unnecessary.
A fixed version of your code could look like this:
String s = "How are you?";
StringTokenizer st = new StringTokenizer(s, " ?");
while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}
Lastly, as the documentation suggests, you should prefer using the split()
method of the String
class, or the Pattern
and Matcher
classes of the java.util.regex package, when in need to split a string.
It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.
An implementation with the split method could be the following with the \W+
pattern, where the \W
predefined character class matches "a non-word character" (anything that doesn't fall into [a-zA-Z0-9_]
), while the greedy quantifier +
matches a pattern "one or more times".
String s = "How are you?";
String[] res = s.split("\\W+");
for (String elem : res) {
System.out.println(elem);
}
Alternatively, if you want to experiment with the Pattern
and Matcher
classes, you need to invert the logic, where instead of looking for the characters you want to split by, you need to provide the pattern of characters you want to match. In your case, the pattern would be \w+
, as you want to match "one or more word characters", specifically [a-zA-Z0-9_]
.
String s = "How are you?";
Pattern pattern = Pattern.compile("\\w+");
Matcher matcher = pattern.matcher(s);
while(matcher.find()){
System.out.println(matcher.group());
}
Upvotes: 2
Reputation: 9192
Strings can contain all sorts data other than just alphabetic characters, spaces, and punctuation. They can also contain signed or unsigned integer and or floating point numerical values and even newline character sequences like "\n"
or "\r\n"
. I suppose it's all a matter of what you want to consider is a String token.
You already know what the issue is with your code by the information already provided within this thread. I would just like to add that numerical values within a string can also be part of your token list by using the old legacy StringTokenizer class or the more common String#split() method.
Here is some example code which can be easily modified to suit your specific needs (read the comments in code):
Using the StringTokenizer class: (not recommended):
// Test String:
String s = "This is a test string. It has words and both signed or "
+ "unsigned integer (-378) and floating point [47.86] "
+ "numbers. White-spaces separate the words. These_will_"
+ "also.be.separate.tokens";
/* Words and numbers (signed or unsigned integer or floating point) will
be tokenized:
Prepare the string by removing all punctuation and special characters
except the Hyphen (-). Also remove any double-spacing with single-space.
The String#replaceAll() method is used for this due to its flexability
and usability of Regular Expressions (regex). The regex below basically
removes all non-word characters (in any language) from the string 'except'
the Hyphen (-) and numerical digits, be it in an integer or floating point
format. The second `replaceAll()` call removes any double-spacing. */
String regex = "(?<!\\d)\\.(?!\\d)|[^\\p{L}\\d\\s-\\.]";
String string = s.replaceAll(regex, " ").replaceAll("\\s+", " ").trim();
/* System.lineSeparator is added to the characters list (as is white-space)
for cases where one or more OS dependent newline character sequences is
contained within the supplied string. For example, this can be problematic:
"How are \n you?" or this: "How are \nyou?". This however, would be okay:
"How are\nyou?", StringTokenizer can automatically deal with it. */
StringTokenizer st = new StringTokenizer(string, " " + System.lineSeparator());
// List Interface to hold the tokens:
List<String> tokens = new ArrayList<>();
// Iterate through the available tokens and add to List (`tokens`):
while (st.hasMoreTokens()) {
tokens.add(st.nextToken()); // Add determined token to the List:
}
// Display each element (word) within the `tokens` List:
System.out.println("List of Tokens:");
System.out.println("===============");
for (String str : tokens) {
System.out.println(str);
}
When the code above is run, the console will display:
List of Tokens:
===============
This
is
a
test
string
It
has
words
and
both
signed
or
unsigned
integer
-378
and
floating
point
47.86
numbers
White-spaces
separate
the
words
These
will
also
be
separate
tokens
Using the String#split() method: (recommended)
// Test String:
String s = "This is a test string. It has words and both signed or "
+ "unsigned integer (-378) and floating point [47.86] "
+ "numbers. White-spaces separate the words. These_will_"
+ "also.be.separate.tokens";
/* Words and numbers (signed or unsigned integer or floating point) will
be tokenized:
Prepare the string by removing all punctuation and special characters
except the Hyphen (-). Also remove any double-spacing with single-space.
The String#replaceAll() method is used for this due to its flexability
and usability of Regular Expressions (regex). The regex below basically
removes all non-word characters (in any language) from the string 'except'
the Hyphen (-) and numerical digits, be it in a signrd or unsigned integer
or floating point format. The second `replaceAll()` call removes any
double-spacing. */
String regex = "(?<!\\d)\\.(?!\\d)|[^\\p{L}\\d\\s-\\.]";
String string = s.replaceAll(regex, " ").replaceAll("\\s+", " ").trim();
// Split the string into a tokenized String[] array:
String[] tokens = string.split("\\s");
// Display each element (word) within the `tokens` String array:
System.out.println("List of Tokens:");
System.out.println("===============");
for (String str : tokens) {
System.out.println(str);
}
When the code above is run, the console will yet again display:
List of Tokens:
===============
This
is
a
test
string
It
has
words
and
both
signed
or
unsigned
integer
-378
and
floating
point
47.86
numbers
White-spaces
separate
the
words
These
will
also
be
separate
tokens
Upvotes: 0
Reputation: 1
The characters in the delim argument are the delimiters for separating tokens,You should use:
StringTokenizer st= new StringTokenizer(s,".|?||!|| |");
```
Upvotes: -2