Reputation: 161
I'd like to read the "text8" corpus in Java and reformat some words. The problem is, in this 100MB corpus all words are on one line. So if I try to load it with BufferedReader
and readLine
, it takes away too much space at once and can't handle it to separate all the words in one list/array.
So my question: Is it possible in Java to read instead of line by line a corpus, to read it word by word? So for example because all words are on one line, to read for example 100 words per iteration?
Upvotes: 6
Views: 15774
Reputation: 1
try(FileInputStream fis = new FileInputStream("Example.docx")) {
ZipSecureFile.setMinInflateRatio(0.009);
XWPFDocument file = new XWPFDocument(OPCPackage.open(fis));
ext = new XWPFWordExtractor(file);
Scanner scanner = new Scanner(ext.getText());
while(scanner.hasNextLine()) {
String[] value = scanner.nextLine().split(" ");
for(String v:value) {
System.out.println(v);
}
}
}catch(Exception e) {
System.out.println(e);
}
Upvotes: -1
Reputation: 1892
Use the next
method of java.util.Scanner
The
next
method finds and returns the next complete token from this scanner. A complete token is preceded and followed by input that matches the delimiter pattern. This method may block while waiting for input to scan, even if a previous invocation of Scanner.hasNext returned true.
Example:
public static void main(String[] args) {
Scanner sc = new Scanner (System.in);
String a = sc.next();
String b = sc.next();
System.out.println("First Word: "+a);
System.out.println("Second Word: "+b);
sc.close();
}
Input :
Hello Stackoverflow
Output :
First Word: Hello
Second Word: Stackoverflow
In your case use Scanner
for reading the file and then use scannerobject.next()
method for reading each token(word)
Upvotes: 1
Reputation: 5423
you can try using Scanner
and set the delimiter to whatever suits you:
Scanner input=new Scanner(myFile);
input.useDelimiter(" +"); //delimitor is one or more spaces
while(input.hasNext()){
System.out.println(input.next());
}
Upvotes: 6
Reputation: 524
I would suggest you to use the "Character stream" with FileReader
Here is the example code from http://www.tutorialspoint.com/java/java_files_io.htm
import java.io.*;
public class CopyFile {
public static void main(String args[]) throws IOException
{
FileReader in = null;
FileWriter out = null;
try {
in = new FileReader("input.txt");
out = new FileWriter("output.txt");
int c;
while ((c = in.read()) != -1) {
out.write(c);
}
}finally {
if (in != null) {
in.close();
}
if (out != null) {
out.close();
}
}
}
}
It reads 16 bit Unicode characters. This way it doesnt matter if your text is in one whole line.
Since you're trying to search word by word, you can easy read till you stumble upon a space and there's your word.
Upvotes: 2