Imen Majdoubi
Imen Majdoubi

Reputation: 21

optimize a java code that match 2 text files having enormous size

i have a file containing the index of the document and publication date:

0, 2012-05-26T00:00:00Z

1, 2012-05-26T00:00:00Z

5, 2010-06-26T00:00:00Z

10, 2014-05-26T00:00:00Z

and a second text file containing the term frequency and the index of the doc that belongs to him:

was, 15, 1

kill, 10,1

tunisia, 5, 5

peace, 1, 0

i have this method that match both of the files so i can get a third file with this form:

was, 15, 2012-05-26T00:00:00Z

kill, 10,2012-05-26T00:00:00Z

tunisia, 5, 2010-06-26T00:00:00Z

peace, 1, 2012-05-26T00:00:00Z

I tested the method of a test file and it work fine but my file's size is 1T so my program has been in execution for 4 days and still working. would you plz help me to optimize it or give me another method.

 public void matchingDateTerme (String pathToDateFich, String pathTotermeFich)  {

  try {

        BufferedReader inTerme = new BufferedReader(new FileReader(pathTotermeFich));
        BufferedReader inDate = new BufferedReader(new FileReader(pathToDateFich));
        String lineTerme,lineDate;
        String idFich, idFichDate,dateterm,key;
        Hashtable<String, String> table = new Hashtable<String, String>();
        String[] tokens,dates;
        Enumeration ID=null;
        File tempFile = new File(pathTotermeFich.replace("fichierTermes", "fichierTermes_final"));
        FileWriter fileWriter =new FileWriter(tempFile);
        BufferedWriter writer = new BufferedWriter(fileWriter);

        //read file date
        while ((lineDate = inDate.readLine()) != null) {
            dates = lineDate.split(", ");
            idFichDate = dates[0].toLowerCase();
            dateterm=dates[1];
            table.put(idFichDate, dateterm);
        }

        while ((lineTerme = inTerme.readLine()) != null) {
            tokens = lineTerme.split(", ");
            idFich = tokens[2].toLowerCase();
            String terme=tokens[0];
            String freq=tokens[1];
            //lire hachtable
            ID = table.keys();
            while(ID.hasMoreElements()) {
               key = (String) ID.nextElement();
               if(key.equalsIgnoreCase(idFich)){
                   String line=terme+", "+freq+", "+table.get(key);
                   System.out.println("Line: "+line);
                   writer.write(line);
                   writer.newLine();
               }
            }
        }


        writer.close();
        inTerme.close();
        inDate.close();

    } catch (FileNotFoundException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }



}

Upvotes: 1

Views: 98

Answers (3)

Mahesh
Mahesh

Reputation: 5308

There are a couple of considerations.

  1. Does this absolutely have to be done in java? If yes, can you sort the files before you start reading them?
  2. Do you have to run through the files in a single pass (highly doubt). If not, you should split the sorted files into parts, and only run through a subset of entries in the part-file.
  3. Are both the files in excess of 1T? If not, you should start with the file with lesser size.

Given file1:

0,2012-05-26T00:00:00Z
1,2012-05-26T00:00:00Z
5,2010-06-26T00:00:00Z
10,2014-05-26T00:00:00Z

and file2:

was,15,1
kill,10,1
tunisia,5,5
peace,1,0

Here is an awk-based solution based on updated inputs:

awk -F',' 'FNR==NR{a[$1]=$2;next}{if(a[$3]==""){a[$3]=0}; print $1,",",$2,",",a[$3]} ' file1 file2

Output:

was , 15 , 2012-05-26T00:00:00Z
kill , 10 , 2012-05-26T00:00:00Z
tunisia , 5 , 2010-06-26T00:00:00Z
peace , 1 , 2012-05-26T00:00:00Z

This answer was helpful for me to derive above solution.

Upvotes: 0

avianey
avianey

Reputation: 5843

You should use https://en.wikipedia.org/wiki/Divide_and_conquer_algorithms approach with the following pseudo algo :

If A and B are your two large files
Open file A(1..n) for writing
Open file A for reading
  for line in file A
    let modulo = key % n
    write line in file A(modulo)
Open file B(1..n) for writing
Open file B for reading
  for line in file B
    let modulo = key % n
    write line in file B(modulo+1)
for i = 1..n
  Open file R(i) for writing
  Open files A(i) and B(i)
    merge those files into R(i) using key matching as you do
Open file R for writing
for i = 1..n
  append R(i) to R

try using n = 1024 if your key are uniform it will end up matching files of 1GB

you need free space on your disk (three time the size of A+B if you do not clean the files)

Upvotes: 0

ToYonos
ToYonos

Reputation: 16833

You are not using the Hashtable for what it is : An object that maps keys to values

Iterating over keys is useless and expensive, just use the get method :

if (table.get(idFich) != null) {
    String line = terme + ", " + freq + ", " + table.get(key);
    System.out.println("Line: " + line);
    writer.write(line);
    writer.newLine();
}

As VGR said in comment, using a HashMap which is not synchronized, will be faster. More information here

Upvotes: 1

Related Questions