exsonic01
exsonic01

Reputation: 637

Successive sorting of repeating data but different number of lines

I have a data with following format:

2
1
A    11.364500   48.395199   40.160500         5
B    35.067699   30.944700   24.155300         4
4
2
A    26.274099   38.533298   32.624298         4
B    36.459000   29.662701   17.806299         5
A    15.850300   28.130899   24.753901         4
A    32.098000   33.665699   20.372499         4
5
3
A    17.501801   44.279202    8.005670         5
B    35.853001   43.095901   17.402901         4
B     1.326100   17.127600   39.600300         4
A     9.837760   41.103199   13.062300         5
B    31.686800   44.997501   16.619499         4
3
4
B    31.274700    8.726580   25.267599         4
A    19.032400   41.384701   19.456301         5
B    19.441900   24.286400    6.961680         4
1
5
B    48.973999   15.508400    5.099710         4
6
6
A    42.586601   21.343800   23.280701         5
B    30.145800   13.256200   30.713800         4
B    29.186001   44.353699    9.057160         4
A    39.311199   27.371201   35.473999         5
B    31.437799   30.415001   37.454399         4
B    48.501999    1.277730   21.900600         4
...

The data consists of multiple repeating data blocks. Each data block has the same format as follows:

First line = Number of lines of data block
Second line = ID of data block
From third line = Actual data lines (with number of lines designated in the first line)

For example, if we see the first data block:

2 (=> This data block has 2 number of lines) 
1 (=> This data block's ID is 1 (very first data block)) 
A    11.364500   48.395199   40.160500         5
B    35.067699   30.944700   24.155300         4
(Third and fourth lines of the data block are actual data. 
 The integer of the first line, the number of lines for this data block, is 2, so the actual data consist of 2 lines. 
 All other data blocks follow the same format. ) 
  1. The first two lines are headers of each repeating data block. The first line designates the number of lines of the data block, and the second line is the ID of the data block.
  2. Then the real data follows from the third line of each data block. As I explained, the number of lines of "real data" is designated in the first line as an integer number.
  3. Accordingly, the total number of lines for each data block will be number_of_lines+2. In this example, the total number of lines for data block 1 is 4, and data block 2 costs 6 lines...

This format repeats 100,000 times. In other words, I have 100,000 data blocks just like this example in the single text file. I can provide the total number of the data blocks.

What I wish to do is sort "Actual data lines" of each data block by the fourth column and print out the data in the same format as the original data file. What I wish to achieve from the examples above will look like:

2
1
A    11.364500   48.395199   40.160500         5
B    35.067699   30.944700   24.155300         4
4
2
A    26.274099   38.533298   32.624298         4
A    15.850300   28.130899   24.753901         4
A    32.098000   33.665699   20.372499         4
B    36.459000   29.662701   17.806299         5
5
3
B     1.326100   17.127600   39.600300         4
B    35.853001   43.095901   17.402901         4
B    31.686800   44.997501   16.619499         4
A     9.837760   41.103199   13.062300         5
A    17.501801   44.279202    8.005670         5
3
4
B    31.274700    8.726580   25.267599         4
A    19.032400   41.384701   19.456301         5
B    19.441900   24.286400    6.961680         4
1
5
B    48.973999   15.508400    5.099710         4
6
6
B    31.437799   30.415001   37.454399         4
A    39.311199   27.371201   35.473999         5
B    30.145800   13.256200   30.713800         4
A    42.586601   21.343800   23.280701         5
B    48.501999    1.277730   21.900600         4
B    29.186001   44.353699    9.057160         4
...

I know the typical command to sort using the 4th column the data in Linux, which is:

sort -gk4 data.txt > data_4sort.txt

However, in this case, I need to perform sorting per each data block. Plus, each data block does not have a uniform number of lines, but the number of data lines for each data block varies.

I really have no idea how can I even approach this. I'm considering using a shell script with for loop and/or awk, sed, split, etc, with sort command. But I'm not even sure how can I use the for loop with this problem, or if the for loop and shell script really required to perform this block-wise sorting.

Upvotes: 4

Views: 187

Answers (4)

markp-fuso
markp-fuso

Reputation: 34808

I know, old question but ... another idea that's a bit more performant with newer gawk abilities ...


One idea using gawk and its predefined array scanning orders (ie, eliminate the need for an external sort) ...

awk '
BEGIN { delete arr }                            # hack to pre-declare variable arr as an array

function sort_block() {

    if ( length(arr) > 0 ) {                    # if array is not empty then ...
       PROCINFO["sorted_in"]="@ind_num_desc"    # desginate sorting by index/descending and then ...
       for (i in arr)                           # loop through array ...
           print arr[i]                         # printing arry entries
    }
    delete arr                                  # clear array
}

NF==1 { sort_block()
        print
      }
NF>1  { arr[$4]=$0
        next
      }
END   { sort_block() }                          # flush anything still in array
' block.dat

Running this against OPs sample data set generates:

2
1
A    11.364500   48.395199   40.160500         5
B    35.067699   30.944700   24.155300         4
4
2
A    26.274099   38.533298   32.624298         4
A    15.850300   28.130899   24.753901         4
A    32.098000   33.665699   20.372499         4
B    36.459000   29.662701   17.806299         5
5
3
B     1.326100   17.127600   39.600300         4
B    35.853001   43.095901   17.402901         4
B    31.686800   44.997501   16.619499         4
A     9.837760   41.103199   13.062300         5
A    17.501801   44.279202    8.005670         5
3
4
B    31.274700    8.726580   25.267599         4
A    19.032400   41.384701   19.456301         5
B    19.441900   24.286400    6.961680         4
1
5
B    48.973999   15.508400    5.099710         4
6
6
B    31.437799   30.415001   37.454399         4
A    39.311199   27.371201   35.473999         5
B    30.145800   13.256200   30.713800         4
A    42.586601   21.343800   23.280701         5
B    48.501999    1.277730   21.900600         4
B    29.186001   44.353699    9.057160         4

Repeated runs of this code in a cygwin environment within a VM on a laptop-grade i7 (ie, this is by no means a fast system) provide the following best timing:

real    0m0.026s
user    0m0.000s
sys     0m0.015s

Upvotes: 0

Ed Morton
Ed Morton

Reputation: 203807

Use awk to prepare the text for sorting and then pipe it to 1 call to sort with no subshells being spawned and no temporary files created:

$ cat tst.sh
#!/usr/bin/env bash

awk '
BEGIN { OFS="\t"; recNr=1 }
NF < pNF { ++recNr }
{
    print recNr, NF, NR, $0
    pNF = NF
}
' "${@:--}" |
sort -k1,1n -k2,2n -k7,7nr -k3,3n |
cut -f4-

$ ./tst.sh file
2
1
A    11.364500   48.395199   40.160500         5
B    35.067699   30.944700   24.155300         4
4
2
A    26.274099   38.533298   32.624298         4
A    15.850300   28.130899   24.753901         4
A    32.098000   33.665699   20.372499         4
B    36.459000   29.662701   17.806299         5
5
3
B     1.326100   17.127600   39.600300         4
B    35.853001   43.095901   17.402901         4
B    31.686800   44.997501   16.619499         4
A     9.837760   41.103199   13.062300         5
A    17.501801   44.279202    8.005670         5
3
4
B    31.274700    8.726580   25.267599         4
A    19.032400   41.384701   19.456301         5
B    19.441900   24.286400    6.961680         4
1
5
B    48.973999   15.508400    5.099710         4
6
6
B    31.437799   30.415001   37.454399         4
A    39.311199   27.371201   35.473999         5
B    30.145800   13.256200   30.713800         4
A    42.586601   21.343800   23.280701         5
B    48.501999    1.277730   21.900600         4
B    29.186001   44.353699    9.057160         4

here's the 3rd-run timing difference in execution speed for the OPs input file between the above:

$ time ./tst.sh file >/dev/null

real    0m0.105s
user    0m0.030s
sys     0m0.091s

and the current awk answers that spawn a subshell per block of input to be sorted:

@IceCreamTucan's answer:

$ cat tst2.sh
#!/usr/bin/env bash

awk '{
  if(NF > 1){
    print | "sort -k4,4nr"
  }
  else{
    close("sort -k4,4nr")
    print
  }
}' "${@:--}"

$ time ./tst2.sh file >/dev/null

real    0m0.405s
user    0m0.120s
sys     0m0.121s

@paxdiablo's answer:

$ cat tst3.sh
#!/usr/bin/env bash

awk '
    function sort() {
        close(tmp_file)
        system("sort -rnk4 "tmp_file"; rm -rf "tmp_file)
    }
    BEGIN { tmp_file = "/tmp/tmp_file" }
    state == 0 && NF == 1 { print ; next }
    state == 0 && NF != 1 { print >tmp_file ; state = 1 ; next }
    state == 1 && NF != 1 { print >>tmp_file ; next }
    state == 1 && NF == 1 { sort() ; print ; state = 0 ; next }
    END { if (state == 1) { sort() } }
' "${@:--}"

$ time ./tst3.sh file >/dev/null

real    0m0.804s
user    0m0.060s
sys     0m0.396s

Upvotes: 4

IceCreamToucan
IceCreamToucan

Reputation: 28695

One option to pipe to a command as below without explicitly creating a temporary file.

awk '
BEGIN {
  cmd = "sort -k4,4nr"
}
{
  if(NF > 1){
    print | cmd
  }
  else{
    close(cmd)
    print
  }
}' input_file

Note: Changed to numeric sorting and moved cmd to a variable thanks to helpful tips from @Ed Morton

Output:

2
1
A    11.364500   48.395199   40.160500         5
B    35.067699   30.944700   24.155300         4
4
2
A    26.274099   38.533298   32.624298         4
A    15.850300   28.130899   24.753901         4
A    32.098000   33.665699   20.372499         4
B    36.459000   29.662701   17.806299         5
5
3
B     1.326100   17.127600   39.600300         4
B    35.853001   43.095901   17.402901         4
B    31.686800   44.997501   16.619499         4
A     9.837760   41.103199   13.062300         5
A    17.501801   44.279202    8.005670         5
3
4
B    31.274700    8.726580   25.267599         4
A    19.032400   41.384701   19.456301         5
B    19.441900   24.286400    6.961680         4
1
5
B    48.973999   15.508400    5.099710         4
6
6
B    31.437799   30.415001   37.454399         4
A    39.311199   27.371201   35.473999         5
B    30.145800   13.256200   30.713800         4
A    42.586601   21.343800   23.280701         5
B    48.501999    1.277730   21.900600         4
B    29.186001   44.353699    9.057160         4

Upvotes: 2

paxdiablo
paxdiablo

Reputation: 881783

This can be done using a combination of awk and sort. You use awk to drive the whole process as a very simple state machine, and it follows these rules:

  • There are two states, indicating which line-type group you're processing.
  • The line types are short (state 0, one field) and long (state 1, more fields).
  • Different actions occur depending on current state and current line being processed.
  • Items needing sorting will be temporarily written to a sort file and sorted when needed.

The actions are basically, for each line.

  • In short state:
    • For a short line, just print it.
    • Otherwise (a long line), create new sort file with just that line, set state to long.
  • Otherwise (in long state):
    • If long line, just append it to sort file.
    • Otherwise (a short line), sort file, print short line, set state to short.

At the end, sort/print/delete the sort file if needed (since there won't be a final transition from long line to short line after the final block, this action takes care of that).

That processing is implemented as per below, a shell script that calls awk to do the work:

awk '
    function sort() {
        close(tmp_file)
        system("sort -rnk4 "tmp_file"; rm -rf "tmp_file)
    }
    BEGIN { tmp_file = "/tmp/tmp_file" }
    state == 0 && NF == 1 { print ; next }
    state == 0 && NF != 1 { print >tmp_file ; state = 1 ; next }
    state == 1 && NF != 1 { print >>tmp_file ; next }
    state == 1 && NF == 1 { sort() ; print ; state = 0 ; next }
    END { if (state == 1) { sort() } }
' input_file

You can also do it without a temporary file with an advanced enough awk variant, one that allows for pipes:

awk 'NF > 1 { print | "sort -rnk4"; next }
            { close("sort -rnk4"); print }
' input_file

The advantage of this is a much simpler and shorter awk program that does most of the stuff in the previous script implicitly.

Running either of those scripts over your input file generates, as requested, each separate grouping sorted on the fourth column (the final float):

2
1
A    11.364500   48.395199   40.160500         5
B    35.067699   30.944700   24.155300         4
4
2
A    26.274099   38.533298   32.624298         4
A    15.850300   28.130899   24.753901         4
A    32.098000   33.665699   20.372499         4
B    36.459000   29.662701   17.806299         5
5
3
B     1.326100   17.127600   39.600300         4
B    35.853001   43.095901   17.402901         4
B    31.686800   44.997501   16.619499         4
A     9.837760   41.103199   13.062300         5
A    17.501801   44.279202    8.005670         5
3
4
B    31.274700    8.726580   25.267599         4
A    19.032400   41.384701   19.456301         5
B    19.441900   24.286400    6.961680         4
1
5
B    48.973999   15.508400    5.099710         4
6
6
B    31.437799   30.415001   37.454399         4
A    39.311199   27.371201   35.473999         5
B    30.145800   13.256200   30.713800         4
A    42.586601   21.343800   23.280701         5
B    48.501999    1.277730   21.900600         4
B    29.186001   44.353699    9.057160         4

Upvotes: 5

Related Questions