Reputation: 57
I'd be very grateful for your help with something probably quite simple.
I have a table (table2.txt), which has a single column of randomly generated numbers, and is about a million lines long.
2655087
3721239
5728533
9082076
2016819
8983893
9446748
6607974
I want to create a loop that repeats 10,000 times, so that for iteration 1, I print lines 1 to 4 to a file (file0.txt), for iteration 2, I print lines 5 to 8 (file1.txt), and so on.
What I have so far is this:
#!/bin/bash
for i in {0..10000}
do
awk 'NR==((4 * "$i") +1)' table2.txt > file"$i".txt
awk 'NR==((4 * "$i") +2)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +3)' table2.txt >> file"$i".txt
awk 'NR==((4 * "$i") +4)' table2.txt >> file"$i".txt
done
Desired output for file0.txt:
2655087
3721239
5728533
9082076
Desired output for file1.txt:
2016819
8983893
9446748
6607974
Something is going wrong with this, because I am getting identical outputs from all my files (i.e. they all look like the desired output of file0.txt). Hopefully you can see from my script that during the second iteration, i.e. when i=2, I want the output to be the values of rows 5, 6, 7 and 8.
This is probably a very simple syntax error, and I would be grateful if you can tell me where I'm going wrong (or give me a less cumbersome solution!)
Thank you very much.
Upvotes: 3
Views: 976
Reputation: 3096
With just bash you can do it very simple:
chunk=4
files=10000
head -n $(($chunk*$files)) table2.txt |
split -d -a 5 --additional-suffix=.txt -l $chunk - file
Basically read first 10k lines and split them into chunks of 4 consecutive lines, using file
as prefix and .txt
as suffix for the new files.
If you want a numeric identifier, you will need 5 digits (-a 5
), as pointed in the comments (credit: @kvantour).
Upvotes: 3
Reputation: 26471
The beauty of awk
is that you can do this in one awk
line :
awk '{ print > ("file"c".txt") }
(NR % 4 == 0) { ++c }
(c == 10001) { exit }' <file>
This can be slightly more optimized and file handling friendly (cfr. James Brown):
awk 'BEGIN{f="file0.txt" }
{ print > f }
(NR % 4 == 0) { close(f); f="file"++c".txt" }
(c == 10001) { exit }' <file>
Why did your script fail?
The reason why your script is failing is because you used single quotes and tried to pass a shell variable to it. Your lines should read :
awk 'NR==((4 * '$i') +1)' table2.txt > file"$i".txt
but this is very ugly and should be improved with
awk -v i=$i 'NR==(4*i+1)' table2.txt > file"$i".txt
Why is your script slow?
The way you are processing your file is by doing a loop of 10001 iterations. Per iterations, you perform 4 awk
calls. Each awk
call reads the full file completely and writes out a single line. So in the end you read your files 40004 times.
To optimise your script step by step, I would do the following :
Terminate awk
to step reading the file after the line is print
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i 'NR==(4*i+1){print; exit}' table2.txt > file"$i".txt
awk -v i=$i 'NR==(4*i+2){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+3){print; exit}' table2.txt >> file"$i".txt
awk -v i=$i 'NR==(4*i+4){print; exit}' table2.txt >> file"$i".txt
done
Merge the 4 awk
calls into a single one. This prevents reading the first lines over and over per loop cycle.
#!/bin/bash
for i in {0..10000}; do
awk -v i=$i '(NR<=4*i) {next} # skip line
(NR> 4*(i+1)}{exit} # exit awk
1' table2.txt > file"$i".txt # print line
done
remove the final loop (see top of this answer)
Upvotes: 6
Reputation: 37404
Another awk:
$ awk '{if(NR%4==1){if(i==10000)exit;close(f);f="file" i++ ".txt"}print > f}' file
$ ls
file file0.txt file1.txt
Explained:
awk ' {
if(NR%4==1) { # use mod to recognize first record of group
if(i==10000) # exit after 10000 files
exit # test with 1
close(f) # close previous file
f="file" i++ ".txt" # make a new filename
}
print > f # output record to file
}' file
Upvotes: 2
Reputation: 203368
This is functionally the same as @JamesBrown's answer but just written more awk-ishly so don't accept this, I just posted it to show the more idiomatic awk syntax as you can't put formatted code in a comment.
awk '
(NR%4)==1 { close(out); out="file" c++ ".txt" }
c > 10000 { exit }
{ print > out }
' file
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the reasons why you should avoid shell loops for manipulating text.
Upvotes: 3