GB44444
GB44444

Reputation: 57

Bash script to print X lines of a file in sequence

I'd be very grateful for your help with something probably quite simple.

I have a table (table2.txt), which has a single column of randomly generated numbers, and is about a million lines long.

2655087
3721239
5728533
9082076
2016819
8983893
9446748
6607974

I want to create a loop that repeats 10,000 times, so that for iteration 1, I print lines 1 to 4 to a file (file0.txt), for iteration 2, I print lines 5 to 8 (file1.txt), and so on.

What I have so far is this:

#!/bin/bash
for i in {0..10000}
do
awk 'NR==((4 * "$i") +1)' table2.txt > file"$i".txt
awk 'NR==((4 * "$i") +2)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +3)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +4)' table2.txt >> file"$i".txt
done

Desired output for file0.txt:

2655087
3721239
5728533
9082076

Desired output for file1.txt:

2016819
8983893
9446748
6607974

Something is going wrong with this, because I am getting identical outputs from all my files (i.e. they all look like the desired output of file0.txt). Hopefully you can see from my script that during the second iteration, i.e. when i=2, I want the output to be the values of rows 5, 6, 7 and 8.

This is probably a very simple syntax error, and I would be grateful if you can tell me where I'm going wrong (or give me a less cumbersome solution!)

Thank you very much.

Upvotes: 3

Views: 976

Answers (4)

Jorge Bellon
Jorge Bellon

Reputation: 3096

With just bash you can do it very simple:

chunk=4
files=10000
head -n $(($chunk*$files)) table2.txt |
  split -d -a 5 --additional-suffix=.txt -l $chunk - file

Basically read first 10k lines and split them into chunks of 4 consecutive lines, using file as prefix and .txt as suffix for the new files.

If you want a numeric identifier, you will need 5 digits (-a 5), as pointed in the comments (credit: @kvantour).

Upvotes: 3

kvantour
kvantour

Reputation: 26471

The beauty of awk is that you can do this in one awk line :

awk '{ print > ("file"c".txt") }
     (NR % 4 == 0) { ++c }
     (c == 10001) { exit }' <file>

This can be slightly more optimized and file handling friendly (cfr. James Brown):

awk 'BEGIN{f="file0.txt" }
     { print > f }
     (NR % 4 == 0) { close(f); f="file"++c".txt" }
     (c == 10001) { exit }' <file>

Why did your script fail?

The reason why your script is failing is because you used single quotes and tried to pass a shell variable to it. Your lines should read :

awk 'NR==((4 * '$i') +1)' table2.txt > file"$i".txt

but this is very ugly and should be improved with

awk -v i=$i 'NR==(4*i+1)' table2.txt > file"$i".txt

Why is your script slow?

The way you are processing your file is by doing a loop of 10001 iterations. Per iterations, you perform 4 awk calls. Each awk call reads the full file completely and writes out a single line. So in the end you read your files 40004 times.

To optimise your script step by step, I would do the following :

  1. Terminate awk to step reading the file after the line is print

    #!/bin/bash
    for i in {0..10000}; do
      awk -v i=$i 'NR==(4*i+1){print; exit}' table2.txt > file"$i".txt
      awk -v i=$i 'NR==(4*i+2){print; exit}' table2.txt >> file"$i".txt
      awk -v i=$i 'NR==(4*i+3){print; exit}' table2.txt >> file"$i".txt
      awk -v i=$i 'NR==(4*i+4){print; exit}' table2.txt >> file"$i".txt
    done
    
  2. Merge the 4 awk calls into a single one. This prevents reading the first lines over and over per loop cycle.

    #!/bin/bash
    for i in {0..10000}; do
      awk -v i=$i '(NR<=4*i)    {next}            # skip line
                   (NR> 4*(i+1)}{exit}            # exit awk
                   1' table2.txt > file"$i".txt  # print line
    done
    
  3. remove the final loop (see top of this answer)

Upvotes: 6

James Brown
James Brown

Reputation: 37404

Another awk:

$ awk '{if(NR%4==1){if(i==10000)exit;close(f);f="file" i++ ".txt"}print > f}' file
$ ls 
file file0.txt  file1.txt

Explained:

awk ' {
    if(NR%4==1) {            # use mod to recognize first record of group
        if(i==10000)         # exit after 10000 files 
            exit             # test with 1
        close(f)             # close previous file
        f="file" i++ ".txt"  # make a new filename
    }
    print > f                # output record to file
}' file

Upvotes: 2

Ed Morton
Ed Morton

Reputation: 203368

This is functionally the same as @JamesBrown's answer but just written more awk-ishly so don't accept this, I just posted it to show the more idiomatic awk syntax as you can't put formatted code in a comment.

awk '
    (NR%4)==1 { close(out); out="file" c++ ".txt" }
    c > 10000 { exit }
    { print > out }
' file

See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why you should avoid shell loops for manipulating text.

Upvotes: 3

Related Questions