manil
manil

Reputation: 159

Splitting a .gz file into specified file sizes in Java

This is my first posting, so not sure how apt my description of the issue is..

Below is a program I have written to split a .gz file into files based on the size of each file, the user wants. The parent .gz file is getting split, but not into the size as specified in the code. For example, in the main I have said I want the parent file to be split into files of size 1 MB. But on executing the code, its getting split into n number of files of different sizes. Can someone help me pin point where I am going wrong? Any help would be great as I have run out of ideas..

package com.bitsighttech.collection.packaging;  


import java.io.BufferedReader;  
import java.io.DataInputStream;  
import java.io.File;  
import java.io.FileInputStream;  
import java.io.FileOutputStream;  
import java.io.InputStreamReader;  
import java.util.ArrayList;  
import java.util.List;  
import java.util.regex.Matcher;  
import java.util.regex.Pattern;  
import java.util.zip.GZIPInputStream;  
import java.util.zip.GZIPOutputStream;  

import org.apache.log4j.Logger;  

public class FileSplitter   
{  
    private static Logger logger = Logger.getLogger(FileSplitter.class);  
    private static final long KB = 1024;  
    private static final long MB = KB * KB;        

    public List<File> split(File inputFile, String splitSize)    
    {    
        int expectedNoOfFiles =0;         
        List<File> splitFileList = new ArrayList<File>();  
        try    
        {    
            double parentFileSizeInB = inputFile.length();  
            Pattern p = Pattern.compile("(\\d+)\\s([MmGgKk][Bb])");  
            Matcher m = p.matcher(splitSize);  
            m.matches();  
            String FileSizeString = m.group(1);  
            System.out.println("FileSizeString----------------------"+FileSizeString);  
            String unit = m.group(2);  
            double fileSizeInMB = 0;  

            try {  
                if (unit.toLowerCase().equals("kb"))  
                    fileSizeInMB = Double.parseDouble(FileSizeString) / KB;           
                else if (unit.toLowerCase().equals("mb"))  
                    fileSizeInMB = Double.parseDouble(FileSizeString);                
                else if (unit.toLowerCase().equals("gb"))  
                    fileSizeInMB = Double.parseDouble(FileSizeString) * KB;           
            }   
            catch (NumberFormatException e) {  
                logger.error("invalid number [" + fileSizeInMB  + "] for expected file size");  
            }             
            System.out.println("fileSizeInMB----------------------"+fileSizeInMB);  
            double fileSize = fileSizeInMB * MB;  
            long fileSizeInByte = (long) Math.ceil(fileSize);  
            double noOFFiles = parentFileSizeInB/fileSizeInByte;   
            expectedNoOfFiles =  (int) Math.ceil(noOFFiles);  
            System.out.println("0000000000000000000000000"+expectedNoOfFiles);  
            GZIPInputStream in = new GZIPInputStream(new FileInputStream(inputFile));             
            DataInputStream datain = new DataInputStream(in);  
            BufferedReader fis = new BufferedReader(new InputStreamReader(datain));  
            int count= 0 ;  
            int splinterCount = 1;  
            GZIPOutputStream outputFileWriter = null;  
            while ((count = fis.read()) != -1)   
            {  
                System.out.println("count----------------------1 "+count);  
                int outputFileLength = 0;    
                outputFileWriter = new  GZIPOutputStream(new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles + ".gz"));  
                while (     (count = fis.read()) != -1   
                        &&  outputFileLength < fileSizeInByte  
                ) {    

                    outputFileWriter.write(count);    
                    outputFileLength ++;    
                    count = fis.read();  

                }  
                System.out.println("count----------------------2 "+count);  
                //outputFileWriter.finish();  
                outputFileWriter.close();  
                splinterCount ++;    
            }  
            fis.close();  
            datain.close();  
            in.close();  
            outputFileWriter.close();  
            System.out.println("Finished");  

        }catch(Exception e)    
        {    
            logger.error("Unable to split the file " + inputFile.getName() + " in to " + expectedNoOfFiles);  
            return null;  
        }    
        logger.debug("Successfully split the file [" + inputFile.getName() + "] in to " + expectedNoOfFiles + " files");  
        return splitFileList;  
    }      

    public static void main(String args[])   
    {  
        String filePath1 = "F:\\filename.gz";  
        File  file = new File(filePath1);  

        FileSplitter fileSplitter = new FileSplitter();  
        String splitlen = "1 MB";  
        int noOfFilesSplit = 3;  

        fileSplitter.split(file, splitlen);  

    }  
}  

Upvotes: 2

Views: 1843

Answers (3)

Dmitri
Dmitri

Reputation: 9157

Andreas' answer covers your main question, but there are a lot of problems in that code. Most importantly, you're throwing out one byte for each 'split' (the outer while calls fis.read() and ignores the value).

Why are you wrapping your gzip input stream in a DataInputStream and a BufferedReader if you're still reading it a byte at a time?

Edit

Ah, and you're also throwing out the last byte of each split, too (except for the very last one).

Upvotes: 1

Joni
Joni

Reputation: 111269

When you compress data with gzip the output file size depends on the complexity of data. Here you are compressing equally sized blocks, but their compressed sizes are different. No lossless compression algorithm reduces the size of input by a constant factor.

If you want splinters of equal size you should split the compressed data instead of decompressing first. But that of course means that the splinters have to be decompressed in order and you can't decompress one without reading the ones that precede it.

Upvotes: 0

Andreas Dolk
Andreas Dolk

Reputation: 114787

Hard to tell, but it looks to me like your counting the uncompressed bytes. The compressed chunks (resulting files) will be smaller.

Upvotes: 0

Related Questions