Reputation: 17

Why I cannot get the string without tokens with the program I have written?

 Scanner scan = new Scanner(System.in);
 String s = scan.nextLine();
 Queue q=new LinkedList();
 for(int i=0;i<s.length();i++){
     int x=(int)s.charAt(i);
     if(x<65 || (x>90 && x<97) || x>122) {
         q.add(s.charAt(i));
     }
 }
 System.out.println(q.peek());
 String redex="";
 while(!q.isEmpty()) {
     redex+=q.remove();
 }
 String[] x=s.split(redex,-1);
 for(String y:x) {
     if(y!=null)
         System.out.println(y);
 }

 scan.close();

I am trying to print the string "my name is NLP and I, so, works:fine;"yes"." without tokens such as {[]}+-_)*&%$ but it just prints out all the String as it is, and I don't understand the problem?

Upvotes: 1

Answers (2)

NoDataFound

Reputation: 11969

This is 3 answers in one:

For your initial problem
For a solution without regex
For a correct use of Scanner (this is up to you).

First

When you use a regex build from whatever character you got under the hand, you should quote it:

String[] x=s.split(Pattern.quote(redex),-1);

That would be the usual problem, but the second problem is that you are building a regexp range but you are omitting the [] making the range, so it can work as is:

String[] x=s.split("[" + Pattern.quote(redex) + "]",-1);

This one may work, but may fail if Pattern.quote don't quote - and - is found in between two characters making a range such as : $-!.

This would means: character in range starting at $ from !. It may fail if the range is invalid and my example may be invalid ($ may be after !).

Finally, you may use:

String redex = q.stream()
                .map(Pattern::quote)
                .collect(Collectors.joining("|"));

This regexp should match the unwanted character.

Second:

For the rest, the other answer point out another problem: you are not using the Character.isXXX method to check for valid characters.

Firstly, be wary that some method does not use char but code points. For example, isAlphabetic use code points. A code points is simply a representation of a character in a multibyte encoding. There some unicode character which take two char.

Secondly, I think your problem lies in the fact you are not using the right tool to split your words.

In pseudo code, this should be:

List<String> words = new ArrayList<>();
int offset = 0;
for (int i = 0, n = line.length(); i < n; ++i) {
  // if the character fail to match, then we switched from word to non word
  if (!Character.isLetterOrDigit(line.charAt(i)) {
    if (offset != i) {
      words.add(line.substring(offset, i));
    }
    offset = i + 1; // next char
  }
}
if (offset != line.length()) {
  words.add(line.substring(offset));
}

This would: - Find transition from word to non word and change offset (where we started) - Add word to the list - Add the last token as ending word.

Last

Alternatively, you may also play with Scanner class since it allows you to input a custom delimiter for its hasNext(): https://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html

I quote the class javadoc:

The scanner can also use delimiters other than whitespace. This example reads several items in from a string:

     String input = "1 fish 2 fish red fish blue fish";
     Scanner s = new Scanner(input).useDelimiter("\\s*fish\\s*");
     System.out.println(s.nextInt());
     System.out.println(s.nextInt());
     System.out.println(s.next());
     System.out.println(s.next());
     s.close();

As you guessed, you may pass on any delimiter and then use hasNext() and next() to get only valid words.

For example, using [^a-zA-Z0-9] would split on each non alpha/digit transition.

Upvotes: 1

Mureinik

Reputation: 311978

As noted in the comment, the condition x<65 will catch all sorts of special characters you're not interested in. Using Character's built-in methods will help you write this condition in a clearer, bug-free way:

x = s.charAt(i);
if (Character.isLetter(x) || Character.isWhiteSpace(x)) {
    q.add(x);
}

Upvotes: 0

Why I cannot get the string without tokens with the program I have written?

Answers (2)

Related Questions