flatronka
flatronka

Reputation: 1081

Java Scanner hasNext(String) method do not match sometimes

I was trying to use Java Scanner hasNext method, but I got strange results. Maybe my problem is very obvious, but why the this simple simple expression "[a-zA-Z']+" not working for words like this: "points. anything, supervisor,". I have tried this "[\\w']+" too.

public HashMap<String, Integer> getDocumentWordStructureFromPath(File file) {
    HashMap<String, Integer> dictionary = new HashMap<>();
    try {
        Scanner lineScanner = new Scanner(file);
        while (lineScanner.hasNextLine()) {
            Scanner scanner = new Scanner(lineScanner.nextLine());
            while (scanner.hasNext("[\\w']+")) {
                String word = scanner.next().toLowerCase();
                if (word.length() > 2) {
                    int count = dictionary.containsKey(word) ? dictionary.get(word).intValue() + 1 : 1;
                    dictionary.put(word, new Integer(count));
                }
            }
            scanner.close();
        }
        //scanner.useDelimiter(DELIMITER);
        lineScanner.close();

        return dictionary;

    } catch (FileNotFoundException e) { 
        e.printStackTrace();
        return null;
    }   
}

Upvotes: 0

Views: 7458

Answers (1)

Jose-Rdz
Jose-Rdz

Reputation: 539

Your regular expression should be like this [^a-zA-z]+ as you need to separate all the things that are not letters:

// previous code...
Scanner scanner = new Scanner(lineScanner.nextLine()).useDelimiter("[^a-zA-z]+");
    while (scanner.hasNext()) {
        String word = scanner.next().toLowerCase();
        // ...your other code
    }
}
// ... after code

EDIT-- Why is not working with the hasNext(String) method??

This line:

Scanner scanner = new Scanner(lineScanner.nextLine());

what it really does is to compile a whitespce pattern for you, so if you have for example this test line "Hello World. A test, ok." it will deliver you this tokens:

  • Hello
  • World.
  • A
  • test,
  • ok.

Then if you use scanner.hasNext("[a-ZA-Z]+") you're asking the scanner if there is a token that match your pattern, for this example it will state true for the first token:

  • Hello (as this is the frist token that matches the pattern you specified)

For the next token (World.) it doesn't match the pattern so it will simply fail and scanner.hasNext("[a-ZA-Z]+") will return false so it will never work for words preceded by any character who's not a letter. You get it?

Now... hope this helps.

Upvotes: 1

Related Questions