Prajakta
Prajakta

Reputation: 27

Unix Split Function to split file into multiple files : splitting the record

I have requirement to split the file into multiple file before FTP ( since FTP have limitation of 1 GB). I am using SPLIT function to do so.

split --bytes=$SPLIT_FILE_SIZE $FILE -d $FILE"_"

$SPLIT_FILE_SIZE=900M

Now i am noticing that it is splitting the record also. Also my data in record does not have any NEW LINE character in it.

For e.g.

My original file have

a|b|c|d|e|f
a1|b1|c1|d1|e1|f1
a2|b2|c2|d2|e2|f2
a3|b3|c3|d3|e3|f3
a4|b4|c4|d4|e4|f4

So my split file is

First file content :

a|b|c|d|e|f

a1|b1|c1|d1|e1|f1

a2|b2|c2|

Second file Content :

d2|e2|f2

a3|b3|c3|d3|e3|f3

a4|b4|c4|d4|e4|f4

Appreciate any suggestions.

Upvotes: 1

Views: 157

Answers (3)

Prajakta
Prajakta

Reputation: 27

Here is how I did it

SPLIT_FILE_SIZE=900

avg_length_of_line=awk '{ total += length($0); count++ } END { print total/count }' $FILE

r_avg_length_of_line=printf "%.0f\n" "$avg_length_of_line"

max_limit_of_file=expr $SPLIT_FILE_SIZE \* 1024 \* 1024

max_line_count=echo $((max_limit_of_file / r_avg_length_of_line))

split -l $max_line_count $FILE -d $FILE"_"

Upvotes: 0

Code Different
Code Different

Reputation: 93151

Since you are asking it to split by counting bytes, it doesn't care if the split point is the middle of the line. Instead, get the average of number of bytes per line, add some safety margin and split by line.

split -l=$SPLIT_FILE_LINE $FILE -d $FILE"_"

You can count the number of lines in the file using wc -l $FILENAME. Note that Mac OS X and FreeBSD distributions don't have the -d` option.

Upvotes: 1

ghoti
ghoti

Reputation: 46826

This can be added to as you need, but in the most basic form, as long as you're dealing with text input, you may be able to use something like this:

#!/usr/bin/awk -f

BEGIN {
 inc=1
}

s > 900*1024*1024 {        # 900MB, per your question
 inc++
 s=0
}

{
 s+=length($0)
 print > "outfile." inc
}

This walks through the file, line by line, adding the length to a variable, then resetting the variable and incrementing a counter to be used as an output filename.

Upgrades might include, perhaps, taking the size from a command line option (ARGV[]), or including some sort of status/debugging output as the script runs.

Upvotes: 1

Related Questions