Reputation: 95

Split file into several files based on condition and also number of lines approximately

I have a large file with a sample as below

A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555
A222, 00000, 555
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

It's a sample file which has order header(00000) and related order details(00100, 00200 etc.) I want to split file with around 40000 lines each such that each file has order headers and order details together.

I used GNU parallel to achieve the split of 40000 lines, But I am not able to achieve the split to satisfy the condition that makes sure that the Order Header and its related order details are all together in a line making sure that each file has around 40000 lines each

For the above sample file, if I have to split with around 5 lines each, I would use the below

parallel --pipe -N5 'cat > sample_{#}.txt' <sample.txt

But that would give me

sample1.txt
A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555
A222, 00000, 555

sample2.txt
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

It would have 2nd Order header in the first file, and its related order details in the second one.

The desired should be

sample1.txt
A222, 00000, 555
A222, 00100, 555
A222, 00200, 555
A222, 00300, 555

sample2.txt
A222, 00000, 555
A222, 00100, 555
A222, 00000, 555
A222, 00200, 555

Upvotes: 3

Answers (3)

Walter A

Reputation: 20032

When each Order Header has a lot of records, you might consider the simple

csplit -z sample.txt '/00000,/' '{*}'

This will make a file for each Order Header. It doesn't look at the requirement ~40K and might result in very much files and is only a viable solution when you have a limited number (perhaps 40 ?) different Order Headers.

When you do want different headers combined in a file, consider

awk -v max=40000 '
   function flush() {
      if (last+nr>max || sample==0) {
         outfile="sample_" sample++ ".txt";
         last=0;
      }
      for (i=0;i<nr;i++) print a[i] >> outfile;
      last+=nr;
      nr=0;
   }
   BEGIN { sample=0 }
   /00000,/ { flush(); }
   {a[nr++]=$0}
   END { flush() }
   ' sample.txt

Upvotes: 0

Ole Tange

Reputation: 33740

Single record:

cat file | parallel --pipe --recstart 'A222, 00000, 555' -n1 'echo Single record;cat'

Multiple records (up to --block-size)

cat file | parallel --pipe --recstart 'A222, 00000, 555' --block-size 100 'echo Multiple records;cat'

If 'A222' does not stay the same:

cat file | parallel -k --pipe --regexp --recstart '[A-Z]\d+, 00000' -N1 'echo Single record;cat'

Upvotes: 0

anubhava

Reputation: 785866

You may try this code:

( export hdr=$(head -1 sample.txt); parallel  --pipe -N4 '{ echo "$hdr"; cat; } > sample_{#}.txt' < <(tail -n +2 sample.txt) )

We basically keep header row separate and run split on remaining lines while including header in each split file.

Upvotes: 4

Split file into several files based on condition and also number of lines approximately

Answers (3)

Related Questions