Rainflow
Rainflow

Reputation: 161

Java - How to read a big file word by word instead of line by line?

I'd like to read the "text8" corpus in Java and reformat some words. The problem is, in this 100MB corpus all words are on one line. So if I try to load it with BufferedReader and readLine, it takes away too much space at once and can't handle it to separate all the words in one list/array.

So my question: Is it possible in Java to read instead of line by line a corpus, to read it word by word? So for example because all words are on one line, to read for example 100 words per iteration?

Upvotes: 6

Views: 15774

Answers (4)

Rajan Kesharwani
Rajan Kesharwani

Reputation: 1

    try(FileInputStream fis = new FileInputStream("Example.docx")) { 
        ZipSecureFile.setMinInflateRatio(0.009);
        XWPFDocument file   = new XWPFDocument(OPCPackage.open(fis));  
        ext = new XWPFWordExtractor(file);  
        Scanner scanner = new Scanner(ext.getText());
        while(scanner.hasNextLine()) {
            String[] value = scanner.nextLine().split(" ");
            for(String v:value) {
                System.out.println(v);
            }
        }
    }catch(Exception e) {  
        System.out.println(e);  
    }  

Upvotes: -1

rhitz
rhitz

Reputation: 1892

Use the next method of java.util.Scanner

The next method finds and returns the next complete token from this scanner. A complete token is preceded and followed by input that matches the delimiter pattern. This method may block while waiting for input to scan, even if a previous invocation of Scanner.hasNext returned true.

Example:

public static void main(String[] args) {
        Scanner sc = new Scanner (System.in); 
        String a = sc.next();
        String b = sc.next();
        System.out.println("First Word: "+a);
        System.out.println("Second Word: "+b);
        sc.close();
    }

Input :

Hello Stackoverflow

Output :

First Word: Hello

Second Word: Stackoverflow

In your case use Scanner for reading the file and then use scannerobject.next() method for reading each token(word)

Upvotes: 1

nafas
nafas

Reputation: 5423

you can try using Scanner and set the delimiter to whatever suits you:

Scanner input=new Scanner(myFile);
input.useDelimiter(" +"); //delimitor is one or more spaces

while(input.hasNext()){
  System.out.println(input.next());
}

Upvotes: 6

MiKE
MiKE

Reputation: 524

I would suggest you to use the "Character stream" with FileReader

Here is the example code from http://www.tutorialspoint.com/java/java_files_io.htm

import java.io.*;

public class CopyFile {
   public static void main(String args[]) throws IOException
   {
      FileReader in = null;
      FileWriter out = null;

      try {
         in = new FileReader("input.txt");
         out = new FileWriter("output.txt");

         int c;
         while ((c = in.read()) != -1) {
            out.write(c);
         }
      }finally {
         if (in != null) {
            in.close();
         }
         if (out != null) {
            out.close();
         }
      }
   }
}

It reads 16 bit Unicode characters. This way it doesnt matter if your text is in one whole line.

Since you're trying to search word by word, you can easy read till you stumble upon a space and there's your word.

Upvotes: 2

Related Questions