David Soroko
David Soroko

Reputation: 9086

Use Scanner to tokenize a file

I need to tokenize a text file where tokens are defined by "[a-zA-Z]+" The following works:

Pattern WORD = Pattern.compile("[a-zA-Z]+");

File f = new File(...);
FileInputStream inputStream = new FileInputStream(f);
Scanner scanner = new Scanner(inputStream); e problem is 

String word = null;

while( (word = scanner.findWithinHorizon(WORD, (int)f.length() )) != null ) {
    // process the word
}

The problem is that findWithinHorizon requires int as the horizon while the file length is of type long.

What is a sensible way tokenize a large file using a Scanner?

Upvotes: 1

Views: 488

Answers (1)

Jean Logeart
Jean Logeart

Reputation: 53819

Use a delimiter that is the negation of the matching pattern:

Scanner s = new Scanner(f).useDelimiter("[^a-zA-Z]+");
while(s.hasNext()) {
    String token = s.next();
    // do something with "token"
}

Upvotes: 3

Related Questions