Reputation: 637
I have a data with following format:
2
1
A 11.364500 48.395199 40.160500 5
B 35.067699 30.944700 24.155300 4
4
2
A 26.274099 38.533298 32.624298 4
B 36.459000 29.662701 17.806299 5
A 15.850300 28.130899 24.753901 4
A 32.098000 33.665699 20.372499 4
5
3
A 17.501801 44.279202 8.005670 5
B 35.853001 43.095901 17.402901 4
B 1.326100 17.127600 39.600300 4
A 9.837760 41.103199 13.062300 5
B 31.686800 44.997501 16.619499 4
3
4
B 31.274700 8.726580 25.267599 4
A 19.032400 41.384701 19.456301 5
B 19.441900 24.286400 6.961680 4
1
5
B 48.973999 15.508400 5.099710 4
6
6
A 42.586601 21.343800 23.280701 5
B 30.145800 13.256200 30.713800 4
B 29.186001 44.353699 9.057160 4
A 39.311199 27.371201 35.473999 5
B 31.437799 30.415001 37.454399 4
B 48.501999 1.277730 21.900600 4
...
The data consists of multiple repeating data blocks. Each data block has the same format as follows:
First line = Number of lines of data block
Second line = ID of data block
From third line = Actual data lines (with number of lines designated in the first line)
For example, if we see the first data block:
2 (=> This data block has 2 number of lines)
1 (=> This data block's ID is 1 (very first data block))
A 11.364500 48.395199 40.160500 5
B 35.067699 30.944700 24.155300 4
(Third and fourth lines of the data block are actual data.
The integer of the first line, the number of lines for this data block, is 2, so the actual data consist of 2 lines.
All other data blocks follow the same format. )
This format repeats 100,000 times. In other words, I have 100,000 data blocks just like this example in the single text file. I can provide the total number of the data blocks.
What I wish to do is sort "Actual data lines" of each data block by the fourth column and print out the data in the same format as the original data file. What I wish to achieve from the examples above will look like:
2
1
A 11.364500 48.395199 40.160500 5
B 35.067699 30.944700 24.155300 4
4
2
A 26.274099 38.533298 32.624298 4
A 15.850300 28.130899 24.753901 4
A 32.098000 33.665699 20.372499 4
B 36.459000 29.662701 17.806299 5
5
3
B 1.326100 17.127600 39.600300 4
B 35.853001 43.095901 17.402901 4
B 31.686800 44.997501 16.619499 4
A 9.837760 41.103199 13.062300 5
A 17.501801 44.279202 8.005670 5
3
4
B 31.274700 8.726580 25.267599 4
A 19.032400 41.384701 19.456301 5
B 19.441900 24.286400 6.961680 4
1
5
B 48.973999 15.508400 5.099710 4
6
6
B 31.437799 30.415001 37.454399 4
A 39.311199 27.371201 35.473999 5
B 30.145800 13.256200 30.713800 4
A 42.586601 21.343800 23.280701 5
B 48.501999 1.277730 21.900600 4
B 29.186001 44.353699 9.057160 4
...
I know the typical command to sort using the 4th column the data in Linux, which is:
sort -gk4 data.txt > data_4sort.txt
However, in this case, I need to perform sorting per each data block. Plus, each data block does not have a uniform number of lines, but the number of data lines for each data block varies.
I really have no idea how can I even approach this. I'm considering using a shell script with for loop and/or awk, sed, split, etc, with sort command. But I'm not even sure how can I use the for loop with this problem, or if the for loop and shell script really required to perform this block-wise sorting.
Upvotes: 4
Views: 187
Reputation: 34808
I know, old question but ... another idea that's a bit more performant with newer gawk
abilities ...
One idea using gawk
and its predefined array scanning orders (ie, eliminate the need for an external sort
) ...
awk '
BEGIN { delete arr } # hack to pre-declare variable arr as an array
function sort_block() {
if ( length(arr) > 0 ) { # if array is not empty then ...
PROCINFO["sorted_in"]="@ind_num_desc" # desginate sorting by index/descending and then ...
for (i in arr) # loop through array ...
print arr[i] # printing arry entries
}
delete arr # clear array
}
NF==1 { sort_block()
print
}
NF>1 { arr[$4]=$0
next
}
END { sort_block() } # flush anything still in array
' block.dat
Running this against OPs sample data set generates:
2
1
A 11.364500 48.395199 40.160500 5
B 35.067699 30.944700 24.155300 4
4
2
A 26.274099 38.533298 32.624298 4
A 15.850300 28.130899 24.753901 4
A 32.098000 33.665699 20.372499 4
B 36.459000 29.662701 17.806299 5
5
3
B 1.326100 17.127600 39.600300 4
B 35.853001 43.095901 17.402901 4
B 31.686800 44.997501 16.619499 4
A 9.837760 41.103199 13.062300 5
A 17.501801 44.279202 8.005670 5
3
4
B 31.274700 8.726580 25.267599 4
A 19.032400 41.384701 19.456301 5
B 19.441900 24.286400 6.961680 4
1
5
B 48.973999 15.508400 5.099710 4
6
6
B 31.437799 30.415001 37.454399 4
A 39.311199 27.371201 35.473999 5
B 30.145800 13.256200 30.713800 4
A 42.586601 21.343800 23.280701 5
B 48.501999 1.277730 21.900600 4
B 29.186001 44.353699 9.057160 4
Repeated runs of this code in a cygwin
environment within a VM on a laptop-grade i7 (ie, this is by no means a fast system) provide the following best timing:
real 0m0.026s
user 0m0.000s
sys 0m0.015s
Upvotes: 0
Reputation: 203807
Use awk to prepare the text for sorting and then pipe it to 1 call to sort
with no subshells being spawned and no temporary files created:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN { OFS="\t"; recNr=1 }
NF < pNF { ++recNr }
{
print recNr, NF, NR, $0
pNF = NF
}
' "${@:--}" |
sort -k1,1n -k2,2n -k7,7nr -k3,3n |
cut -f4-
$ ./tst.sh file
2
1
A 11.364500 48.395199 40.160500 5
B 35.067699 30.944700 24.155300 4
4
2
A 26.274099 38.533298 32.624298 4
A 15.850300 28.130899 24.753901 4
A 32.098000 33.665699 20.372499 4
B 36.459000 29.662701 17.806299 5
5
3
B 1.326100 17.127600 39.600300 4
B 35.853001 43.095901 17.402901 4
B 31.686800 44.997501 16.619499 4
A 9.837760 41.103199 13.062300 5
A 17.501801 44.279202 8.005670 5
3
4
B 31.274700 8.726580 25.267599 4
A 19.032400 41.384701 19.456301 5
B 19.441900 24.286400 6.961680 4
1
5
B 48.973999 15.508400 5.099710 4
6
6
B 31.437799 30.415001 37.454399 4
A 39.311199 27.371201 35.473999 5
B 30.145800 13.256200 30.713800 4
A 42.586601 21.343800 23.280701 5
B 48.501999 1.277730 21.900600 4
B 29.186001 44.353699 9.057160 4
here's the 3rd-run timing difference in execution speed for the OPs input file between the above:
$ time ./tst.sh file >/dev/null
real 0m0.105s
user 0m0.030s
sys 0m0.091s
and the current awk answers that spawn a subshell per block of input to be sorted:
$ cat tst2.sh
#!/usr/bin/env bash
awk '{
if(NF > 1){
print | "sort -k4,4nr"
}
else{
close("sort -k4,4nr")
print
}
}' "${@:--}"
$ time ./tst2.sh file >/dev/null
real 0m0.405s
user 0m0.120s
sys 0m0.121s
$ cat tst3.sh
#!/usr/bin/env bash
awk '
function sort() {
close(tmp_file)
system("sort -rnk4 "tmp_file"; rm -rf "tmp_file)
}
BEGIN { tmp_file = "/tmp/tmp_file" }
state == 0 && NF == 1 { print ; next }
state == 0 && NF != 1 { print >tmp_file ; state = 1 ; next }
state == 1 && NF != 1 { print >>tmp_file ; next }
state == 1 && NF == 1 { sort() ; print ; state = 0 ; next }
END { if (state == 1) { sort() } }
' "${@:--}"
$ time ./tst3.sh file >/dev/null
real 0m0.804s
user 0m0.060s
sys 0m0.396s
Upvotes: 4
Reputation: 28695
One option to pipe to a command as below without explicitly creating a temporary file.
awk '
BEGIN {
cmd = "sort -k4,4nr"
}
{
if(NF > 1){
print | cmd
}
else{
close(cmd)
print
}
}' input_file
Note: Changed to numeric sorting and moved cmd to a variable thanks to helpful tips from @Ed Morton
Output:
2
1
A 11.364500 48.395199 40.160500 5
B 35.067699 30.944700 24.155300 4
4
2
A 26.274099 38.533298 32.624298 4
A 15.850300 28.130899 24.753901 4
A 32.098000 33.665699 20.372499 4
B 36.459000 29.662701 17.806299 5
5
3
B 1.326100 17.127600 39.600300 4
B 35.853001 43.095901 17.402901 4
B 31.686800 44.997501 16.619499 4
A 9.837760 41.103199 13.062300 5
A 17.501801 44.279202 8.005670 5
3
4
B 31.274700 8.726580 25.267599 4
A 19.032400 41.384701 19.456301 5
B 19.441900 24.286400 6.961680 4
1
5
B 48.973999 15.508400 5.099710 4
6
6
B 31.437799 30.415001 37.454399 4
A 39.311199 27.371201 35.473999 5
B 30.145800 13.256200 30.713800 4
A 42.586601 21.343800 23.280701 5
B 48.501999 1.277730 21.900600 4
B 29.186001 44.353699 9.057160 4
Upvotes: 2
Reputation: 881783
This can be done using a combination of awk
and sort
. You use awk
to drive the whole process as a very simple state machine, and it follows these rules:
The actions are basically, for each line.
At the end, sort/print/delete the sort file if needed (since there won't be a final transition from long line to short line after the final block, this action takes care of that).
That processing is implemented as per below, a shell script that calls awk
to do the work:
awk '
function sort() {
close(tmp_file)
system("sort -rnk4 "tmp_file"; rm -rf "tmp_file)
}
BEGIN { tmp_file = "/tmp/tmp_file" }
state == 0 && NF == 1 { print ; next }
state == 0 && NF != 1 { print >tmp_file ; state = 1 ; next }
state == 1 && NF != 1 { print >>tmp_file ; next }
state == 1 && NF == 1 { sort() ; print ; state = 0 ; next }
END { if (state == 1) { sort() } }
' input_file
You can also do it without a temporary file with an advanced enough awk
variant, one that allows for pipes:
awk 'NF > 1 { print | "sort -rnk4"; next }
{ close("sort -rnk4"); print }
' input_file
The advantage of this is a much simpler and shorter awk
program that does most of the stuff in the previous script implicitly.
Running either of those scripts over your input file generates, as requested, each separate grouping sorted on the fourth column (the final float):
2
1
A 11.364500 48.395199 40.160500 5
B 35.067699 30.944700 24.155300 4
4
2
A 26.274099 38.533298 32.624298 4
A 15.850300 28.130899 24.753901 4
A 32.098000 33.665699 20.372499 4
B 36.459000 29.662701 17.806299 5
5
3
B 1.326100 17.127600 39.600300 4
B 35.853001 43.095901 17.402901 4
B 31.686800 44.997501 16.619499 4
A 9.837760 41.103199 13.062300 5
A 17.501801 44.279202 8.005670 5
3
4
B 31.274700 8.726580 25.267599 4
A 19.032400 41.384701 19.456301 5
B 19.441900 24.286400 6.961680 4
1
5
B 48.973999 15.508400 5.099710 4
6
6
B 31.437799 30.415001 37.454399 4
A 39.311199 27.371201 35.473999 5
B 30.145800 13.256200 30.713800 4
A 42.586601 21.343800 23.280701 5
B 48.501999 1.277730 21.900600 4
B 29.186001 44.353699 9.057160 4
Upvotes: 5