Reputation: 17
Scanner scan = new Scanner(System.in);
String s = scan.nextLine();
Queue q=new LinkedList();
for(int i=0;i<s.length();i++){
int x=(int)s.charAt(i);
if(x<65 || (x>90 && x<97) || x>122) {
q.add(s.charAt(i));
}
}
System.out.println(q.peek());
String redex="";
while(!q.isEmpty()) {
redex+=q.remove();
}
String[] x=s.split(redex,-1);
for(String y:x) {
if(y!=null)
System.out.println(y);
}
scan.close();
I am trying to print the string "my name is NLP and I, so, works:fine;"yes"." without tokens such as {[]}+-_)*&%$ but it just prints out all the String as it is, and I don't understand the problem?
Upvotes: 1
Views: 85
Reputation: 11969
This is 3 answers in one:
First
When you use a regex build from whatever character you got under the hand, you should quote it:
String[] x=s.split(Pattern.quote(redex),-1);
That would be the usual problem, but the second problem is that you are building a regexp range but you are omitting the []
making the range, so it can work as is:
String[] x=s.split("[" + Pattern.quote(redex) + "]",-1);
This one may work, but may fail if Pattern.quote
don't quote -
and -
is found in between two characters making a range such as : $-!
.
This would means: character in range starting at $
from !
. It may fail if the range is invalid and my example may be invalid ($
may be after !
).
Finally, you may use:
String redex = q.stream()
.map(Pattern::quote)
.collect(Collectors.joining("|"));
This regexp should match the unwanted character.
Second:
For the rest, the other answer point out another problem: you are not using the Character.isXXX
method to check for valid characters.
Firstly, be wary that some method does not use char
but code points. For example, isAlphabetic use code points. A code points is simply a representation of a character in a multibyte encoding. There some unicode character which take two char
.
Secondly, I think your problem lies in the fact you are not using the right tool to split your words.
In pseudo code, this should be:
List<String> words = new ArrayList<>();
int offset = 0;
for (int i = 0, n = line.length(); i < n; ++i) {
// if the character fail to match, then we switched from word to non word
if (!Character.isLetterOrDigit(line.charAt(i)) {
if (offset != i) {
words.add(line.substring(offset, i));
}
offset = i + 1; // next char
}
}
if (offset != line.length()) {
words.add(line.substring(offset));
}
This would: - Find transition from word to non word and change offset (where we started) - Add word to the list - Add the last token as ending word.
Last
Alternatively, you may also play with Scanner
class since it allows you to input a custom delimiter for its hasNext()
: https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html
I quote the class javadoc:
The scanner can also use delimiters other than whitespace. This example reads several items in from a string:
String input = "1 fish 2 fish red fish blue fish"; Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*"); System.out.println(s.nextInt()); System.out.println(s.nextInt()); System.out.println(s.next()); System.out.println(s.next()); s.close();
As you guessed, you may pass on any delimiter and then use hasNext()
and next()
to get only valid words.
For example, using [^a-zA-Z0-9]
would split on each non alpha/digit transition.
Upvotes: 1
Reputation: 311978
As noted in the comment, the condition x<65
will catch all sorts of special characters you're not interested in. Using Character
's built-in methods will help you write this condition in a clearer, bug-free way:
x = s.charAt(i);
if (Character.isLetter(x) || Character.isWhiteSpace(x)) {
q.add(x);
}
Upvotes: 0