Reputation: 9086
I need to tokenize a text file where tokens are defined by "[a-zA-Z]+" The following works:
Pattern WORD = Pattern.compile("[a-zA-Z]+");
File f = new File(...);
FileInputStream inputStream = new FileInputStream(f);
Scanner scanner = new Scanner(inputStream); e problem is
String word = null;
while( (word = scanner.findWithinHorizon(WORD, (int)f.length() )) != null ) {
// process the word
}
The problem is that findWithinHorizon
requires int
as the horizon while the
file length is of type long
.
What is a sensible way tokenize a large file using a Scanner?
Upvotes: 1
Views: 488
Reputation: 53819
Use a delimiter that is the negation of the matching pattern:
Scanner s = new Scanner(f).useDelimiter("[^a-zA-Z]+");
while(s.hasNext()) {
String token = s.next();
// do something with "token"
}
Upvotes: 3