maasha
maasha

Reputation: 1995

Ruby: How to split a file into multiple files of a given size

I want to split a txt file into multiple files where each file contains no more than 5Mb. I know there are tools for this, but I need this for a project and want to do it in Ruby. Also, I prefer to do this with File.open in block context if possible, but I fail miserably :o(

#!/usr/bin/env ruby

require 'pp'

MAX_BYTES = 5_000_000

file_num = 0
bytes    = 0

File.open("test.txt", 'r') do |data_in|
  File.open("#{file_num}.txt", 'w') do |data_out|
    data_in.each_line do |line|
      data_out.puts line

      bytes += line.length

      if bytes > MAX_BYTES
        bytes = 0
        file_num += 1
        # next file
      end
    end
  end
end

This work, but I don't think it is elegant. Also, I still wonder if it can be done with File.open in block context only.

#!/usr/bin/env ruby

require 'pp'

MAX_BYTES = 5_000_000

file_num = 0
bytes    = 0

File.open("test.txt", 'r') do |data_in|
  data_out = File.open("#{file_num}.txt", 'w')

  data_in.each_line do |line|
    data_out = File.open("#{file_num}.txt", 'w') unless data_out.respond_to? :write
    data_out.puts line

    bytes += line.length

    if bytes > MAX_BYTES
      bytes = 0
      file_num += 1
      data_out.close
    end
  end

  data_out.close if data_out.respond_to? :close
end

Cheers,

Martin

Upvotes: 7

Views: 11403

Answers (4)

asaaki
asaaki

Reputation: 2000

[Updated] Wrote a short version without any helper variables and put everything in a method:

def chunker f_in, out_pref, chunksize = 1_073_741_824
  File.open(f_in,"r") do |fh_in|
    until fh_in.eof?
      File.open("#{out_pref}_#{"%05d"%(fh_in.pos/chunksize)}.txt","w") do |fh_out|
        fh_out << fh_in.read(chunksize)
      end
    end
  end
end

chunker "inputfile.txt", "output_prefix" (, desired_chunk_size)

Instead of a line loop you can use .read(length) and do a loop only for the EOF marker and the file cursor.

This takes care that the chunky files are never bigger than your desired chunk size.

On the other hand it never takes care for line breaks (\n)!

Numbers for chunk files will be generated from integer division of current file curser position by chunksize, formatted with "%05d" which result in 5-digit numbers with leading zero (00001).

This is only possible because .read(chunksize) is used. In the second example below, it could not be used!

Update: Splitting with line break recognition

If your really need complete lines with \n then use this modified code snippet:

def chunker f_in, out_pref, chunksize = 1_073_741_824
  outfilenum = 1
  File.open(f_in,"r") do |fh_in|
    until fh_in.eof?
      File.open("#{out_pref}_#{outfilenum}.txt","w") do |fh_out|
        loop do
          line = fh_in.readline
          fh_out << line
          break if fh_out.size > (chunksize-line.length) || fh_in.eof?
        end
      end
      outfilenum += 1
    end
  end
end

I had to introduce a helper variable line because I want to ensure that the chunky file size is always below the chunksize limit! If you don't do this extended check you will get also file sizes above the limit. The while statement only successfully checks in next iteration step when the line is already written. (Working with .ungetc or other complex calculations will make the code more unreadable and not shorter than this example.)

Unfortunately you have to have a second EOF check, because the last chunk iteration will mostly result in a smaller chunk.

Also two helper variables are needed: the line is described above, the outfilenum is needed, because the resulting file sizes mostly do not match the exact chunksize.

Upvotes: 19

Mario Trento
Mario Trento

Reputation: 523

This code actually works, it's simple and it uses array which make it faster:

#!/usr/bin/env ruby
data = Array.new()
MAX_BYTES = 3500
MAX_LINES = 32
lineNum = 0
file_num = 0
bytes    = 0


filename = 'W:/IN/tangoZ.txt_100.TXT'
r = File.exist?(filename)
puts 'File exists =' + r.to_s + ' ' +  filename
file=File.open(filename,"r")
line_count = file.readlines.size
file_size = File.size(filename).to_f / 1024000
puts 'Total lines=' + line_count.to_s + '   size=' + file_size.to_s + ' Mb'
puts ' '


file = File.open(filename,"r")
#puts '1 File open read ' + filename
file.each{|line|          
     bytes += line.length
     lineNum += 1
     data << line    

        if bytes > MAX_BYTES  then
       # if lineNum > MAX_LINES  then     
              bytes = 0
              file_num += 1
          #puts '_2 File open write ' + file_num.to_s + '  lines ' + lineNum.to_s
             File.open("#{file_num}.txt", 'w') {|f| f.write data.join}
         data.clear
         lineNum = 0
        end



}

## write leftovers
file_num += 1
#puts '__3 File open write FINAL' + file_num.to_s + '  lines ' + lineNum.to_s
File.open("#{file_num}.txt", 'w') {|f| f.write data.join}

Upvotes: 0

Wayne Conrad
Wayne Conrad

Reputation: 107999

For files of any size, split will be faster than scratch-built Ruby code, even taking the cost of starting a separate executable into account. It's also code that you don't have to write, debug or maintain:

system("split -C 1M -d test.txt ''")

The options are:

  • -C 1M Put lines totalling no more than 1M in each chunk
  • -d Use decimal suffixes in the output filenames
  • test.txt The name of the input file
  • '' Use a blank output file prefix

Unless you're on Windows, this is the way to go.

Upvotes: 14

xinit
xinit

Reputation: 1916

Instead of opening your outfile inside the infile block, open the file and assign it to variable. When you hit the filesize limit, close the file and open a new one.

Upvotes: 0

Related Questions