Suri
Suri

Reputation: 209

Improve the speed of reading and writing big files with Buffered Write/Reader

I want to read text files and convert each word to a number. Then for each file write sequence of numbers instead of word in a new file. I used a HashMap to assigned just one number (identifier) for each word, for instance, the word apple is assigned to number 10 so whenever, I see apple in a text file I write 10 in the sequence. I need to have just one HashMap to prevent assigned more than one identifier to a word. I wrote the following code but it process file slowly. For instance, converting a text file with size 165.7 MB to a file of sequence took 20 hours. I need to convert 600 text file with the same size to sequence files. I want to know is there any way to improve the efficiency of my code . The following function is called for each text file.

public void ConvertTextToSequence(File file) {
    try{

        FileWriter filewriter=new FileWriter(path.keywordDocIdsSequence,true);
        BufferedWriter bufferedWriter= new BufferedWriter(filewriter);

        String sequence="";
        FileReader fileReader = new FileReader(file);
        BufferedReader bufferedReader = new BufferedReader(fileReader);
        String line = bufferedReader.readLine();
        while(line!=null)
        {
            StringTokenizer tokens = new StringTokenizer(line); 

                    String str;
                    while (tokens.hasMoreTokens()) 
                    {
                        str = tokens.nextToken();
                         if(keywordsId.containsKey(str))
                              sequence= sequence+" "+keywordsId.get(stmWord);
                         else
                         {
                              keywordsId.put(str,id);
                              sequence= sequence+" "+id;
                              id++;
                          }


                         if(keywordsId.size()%10000==0)
                         {
                              bufferedWriter.append(sequence);
                              sequence="";

                               start=id;
                         }

                    }
                    String line = bufferedReader.readLine();
                }
        }

        if(start<id)
        {

              bufferedWriter.append(sequence);
        }

        bufferedReader.close();
        fileReader.close();

        bufferedWriter.close();
         filewriter.close();
    }
    catch(Exception e)
    {
        e.printStackTrace();
    }

}

The constructor of that class is:

public ConvertTextToKeywordIds(){
   path= new LocalPath();
   repository= new RepositorySQL();
   keywordsId= new HashMap<String, Integer>();
   id=1;
   start=1;}

Upvotes: 1

Views: 485

Answers (2)

Josh Kergan
Josh Kergan

Reputation: 335

I suspect that the speed of your program is tied to the rehashing of the hash map as the number of words grows. Each rehash can incur a significant time penalty as the size of the hash map grows. You could try and estimate the number of unique words you expect and use that to initialize the hash map.

As mentioned by @JB Nizet you may want to write directly to the buffered writer rather than waiting to accumulate a number of entries. Since the buffered writer is already set up to write only when it has accumulated enough changes.

Upvotes: 2

OldCurmudgeon
OldCurmudgeon

Reputation: 65793

Your most effective performace boost is probably using StringBuilder instead of String for your sequence.

I would also write and flush the sequence each time it exceeds a certain length rather than whenever you've added 10000 words to your map.

This map could get pretty huge - have you considered improving that? If you hit millions of entries you may get better performance using a database.

Upvotes: 1

Related Questions