lucavenanzetti
lucavenanzetti

Reputation: 95

How to split csv file into multiple files by size

in a java project i generate a big csv file (about 500 Mb), and i need to split that file into multiple files of at most 10 Mb size each one. I found a lot of posts similar but any of them answer to my question because in all posts the java code split the original files in exactly 10 Mb files, and (obviously) truncate records. Instead i need each record is complete, intact. Any record should be truncated. If i'm copying a record from the original big csv file to one generated file, and the file dimension will overflow 10 Mb if i copy the record, i should be able to not copy that record, close that file, create a new file and copy the record in the new one. Is it possible? Can someone help me? Thank you!

I tried this code:

File f = new File("/home/luca/Desktop/test/images.csv");
BufferedInputStream bis = new BufferedInputStream(new FileInputStream(f));
FileOutputStream out;
String name = f.getName();
int partCounter = 1;
int sizeOfFiles = 10 * 1024 * 1024;// 1MB
byte[] buffer = new byte[sizeOfFiles];
int tmp = 0;
while ((tmp = bis.read(buffer)) > 0) {
 File newFile=new File("/home/luca/Desktop/test/"+name+"."+String.format("%03d", partCounter++));
 newFile.createNewFile();
 out = new FileOutputStream(newFile);
 out.write(buffer,0,tmp);
 out.close();
}

But obviously doesn't work. This code split a source file in n 10Mb files truncating records. In my case my csv file has 16 columns so with the procedure above i have for example the last record has only 5 columns populated. The others are truncated.

SOLUTION Here the code i wrote.

FileReader fileReader = new FileReader("/home/luca/Desktop/test/images.csv");
BufferedReader bufferedReader = new BufferedReader(fileReader);
String line="";
int fileSize = 0;
BufferedWriter fos = new BufferedWriter(new FileWriter("/home/luca/Desktop/test/images_"+new Date().getTime()+".csv",true));
while((line = bufferedReader.readLine()) != null) {
    if(fileSize + line.getBytes().length > 9.5 * 1024 * 1024){
        fos.flush();
        fos.close();
        fos = new BufferedWriter(new FileWriter("/home/luca/Desktop/test/images_"+new Date().getTime()+".csv",true));
        fos.write(line+"\n");
        fileSize = line.getBytes().length;
    }else{
        fos.write(line+"\n");
        fileSize += line.getBytes().length;
    }
}          
fos.flush();
fos.close();
bufferedReader.close();

This code read a csv file and split it to n files, each file is at most 10 Mb big and each csv line is completely copied or not copied at all.

Upvotes: 8

Views: 10597

Answers (3)

Christopher Broderick
Christopher Broderick

Reputation: 448

This will split any line based file including CSV's into a file within (line length - 1) of the specified size. It will repeat the header row is specified (such as for CSV's with a header row):

protected void processDocument(File inFile, long maxFileSize, boolean containsHeaderRow) {       
    if (maxFileSize > 0 && infile.length() > maxFileSize) {
        FileReader fileReader = new FileReader(inFile);
        BufferedReader bufferedReader = new BufferedReader(fileReader);
        try {
            byte[] headerRow = new byte[0];
            if (containsHeaderRow) {
                try {
                    String headerLine = bufferedReader.readLine();
                    if (headerLine != null) {
                        headerRow = (headerLine + "\n").getBytes();
                    }
                } catch (IOException e1) {
                    throw new Exception("Failed to read header row from input file.", e1);
                }
            }
            long headerRowByteCount = headerRow.length;
            if (maxFileSize < headerRowByteCount) {
                // Would just write header repeatedly so throw error
                throw new Exception("Split file size is less than the header row size.");
            }
            int fileCount = 0;
            boolean notEof = true;
            while (notEof) {
                fileCount += 1;
                long fileSize = 0;
                // create a new file with same path but appended count
                String newFilename = inFile.getAbsolutePath() |+ "-" + fileCount;
                File outFile = new File(newFilename);
                BufferedOutputStream fos = null;
                try {
                    try {
                        fos = new BufferedOutputStream(new FileOutputStream(outFile));
                    } catch (IOException e) {
                        throw new Exception("Failed to initialise output file for file splitting on file " + fileCount, e);
                    }
                    if (containsHeaderRow) {
                        try {
                            fos.write(headerRow);
                        } catch (IOException e) {
                            throw new Exception("Failed to write header row to output file for file splitting on file " + fileCount, e);
                        }
                        fileSize += headerRowByteCount;
                    }
                    while (fileSize < maxFileSize) {
                        String line = null;
                        try {
                            line = bufferedReader.readLine();
                        } catch (IOException e) {
                            throw new Exception("Failed to write output file for file splitting on file " + fileCount, e);
                        }
                        if (line == null) {
                            notEof = false;
                            break;
                        }
                        byte[] lineBytes = (line + "\n").getBytes();
                        fos.write(lineBytes);
                        fileSize += lineBytes.length;
                    }
                    fos.flush();
                    fos.close();
                    processDocument(outFile); 
                } catch (IOException e) {
                    throw new Exception("Failed to write output file for file splitting on file number" + fileCount, e);
                } finally {
                    try {
                        if (fos != null) {
                            fos.close();
                        }
                    } catch (IOException e) {
                    }
                }
            }
        } finally {
            try {
                bufferedReader.close();
            } catch (IOException e) {
                throw new Exception("Failed to close reader for input file.", e);
            }
        }

    } else {
        processDocument(inFile); 
    }
}

Upvotes: 0

Al- Imran Khan
Al- Imran Khan

Reputation: 187

use this split -a 3 -b 100m -d filename.tar.gz newfilename

Upvotes: 0

Durandal
Durandal

Reputation: 20059

In principle very simple.

You create a buffer of 10MB (byte[]) and read as many bytes as you can from the source. Then you search from the back for a line feed. The portion from the beginning of the buffer to the line feed = new file. You retain the part you have read in excess and copy it to start of buffer (offset 0). The you repeat everything until no more source.

Upvotes: 3

Related Questions