Amz
Amz

Reputation: 133

How to sort or rearrange numbers from multiple column into multiple row [fixed into 4 columns]?

I have 1 text file, which is test1.txt.

text1.txt contain as following:
Input:

##[A1] [B1] [T1]  [V1] [T2]  [V2] [T3]  [V3] [T4]  [V4]## --> headers
    1  1000    0   100   10   200   20   300   30   400
              40   500   50   600   60   700   70   800
       1010    0   101   10   201   20   301   30   401
              40   501   50   601  
    2  1000    0   110   15   210   25   310   35   410
              45   510   55   610   65   710
       1010    0   150   10   250   20   350   30   450
              40   550  

Condition:
A1 and B1 -> for each A1 + (B1 + [Tn + Vn])
A1 should be in 1 column.
B1 should be in 1 column.
T1,T2,T3 and T4 should be in 1 column.
V1,V2,V3 and V4 should be in 1 column.

How do I sort it become like below?
Desire Output:

##   A1    B1   Tn    Vn ## --> headers

      1  1000    0   100
                10   200
                20   300
                30   400
                40   500
                50   600
                60   700
                70   800
         1010    0   101
                10   201
                20   301
                30   401
                40   501
                50   601
      2  1000    0   110
                15   210
                25   310
                35   410
                45   510
                55   610
                65   710
         1010    0   150
                10   250
                20   350
                30   450
                40   550

Here is my current code:
First Attempt:
Input

cat test1.txt | awk ' { a=$1 b=$2 } { for(i=1; i<=5; i=i+1) { t=substr($0,11+i*10,5) v=substr($0,16+i*10,5) if( t ~ /^\ +[0-9]+$/ || t ~ /^[0-9]+$/ || t ~ /^\ +[0-9]+\ +$/ ){ printf "%7s %7d %8d %8d \n",a,b,t,v } }}' | less

Output:

      1    1000      400        0 
     40     500      800        0 
   1010       0      401        0 
      2    1000      410        0 
   1010       0      450        0

I'm trying using simple awk command, but still can't get the result.
Can anyone help me on this?

Thanks,
Am

Upvotes: 2

Views: 345

Answers (5)

user3408541
user3408541

Reputation: 67

Here is a Perl solution! I made a couple assumptions which I will list here.

  1. All fields are 4 characters separated by a space, except the v fields which are 5
  2. All the fields end directly under the ] except the A1 field. I assumed that was a copy/paste error and put it in line with the ]. I also removed the ## and leading whitespace as I assumed this was not in the original file.
  3. If there is a tn value, there is a corresponding vn value
  4. a1 and b1 can be empty, but there must be SOME data, then the trailing fields can be empty.
  5. t1 and v1 are the only columns that must always be there

Here is the input file with the changes listed above

$ more table.reformat.txt
[A1] [B1] [T1]  [V1] [T2]  [V2] [T3]  [V3] [T4]  [V4]
   1 1000    0   100   10   200   20   300   30   400
            40   500   50   600   60   700   70   800
     1010    0   101   10   201   20   301   30   401
            40   501   50   601  
   2 1000    0   110   15   210   25   310   35   410
            45   510   55   610   65   710
     1010    0   150   10   250   20   350   30   450
            40   550

Because some data can be there and some data can be missing, I split it into 3 regular expressions to handle each case. I nearly had it working with one regular expression but the shortened rows with possibly trailing whitespace required special attention. A csv file would have been a lot easier. Instead I had to check for varying amounts of whitespace as a separator, as well as possibly missing fields in the front as well as the back.

Here is the code

#!/usr/bin/perl -w

my $reFullLine   =qr"^([\w\W]{4}) ([\w\W]{4}) ([\w\W]{4}) ([\w\W]{5}) ([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?";
my $reOneMissing =qr"^ {5}([\w\W]{4}) ([\w\W]{4}) ([\w\W]{5}) ([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?";
my $reBothMissing=qr"^ {10}([\w\W]{4}) ([\w\W]{5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?";

while(<>){
  my ($a1,$b1,@t,@v);
  chomp;
  s/ +$//;              #remove trailing whitespace, this matters for columns with incomplete data
  if(/\[/){              #header line, display new headers and go to next line
    print "[A1] [B1] [Tn]  [Vn]\n";
    next;
  }
  if(!&a1IsEmpty($_) && !&b1IsEmpty($_)){     #a1 and b1 are there
    /$reFullLine/ or die("Incorrect data formatting");
     ($a1,$b1,$t[0],$v[0],$t[1],$v[1],$t[2],$v[2],$t[3],$v[3])=("$1","$2","$3","$4","$5","$6","$7","$8","$9","$10");
  } elsif( &a1IsEmpty($_) && !b1IsEmpty($_) ){ #a1 is empty and b1 is not, skip where a1 is (5 spaces) and start at b1
    /$reOneMissing/ or die("Incorrect data formatting");
    ($a1,$b1,$t[0],$v[0],$t[1],$v[1],$t[2],$v[2],$t[3],$v[3])=("    ","$1","$2","$3","$4","$5","$6","$7","$8","$9");
  } else{                                       #both a1 and b1 are empty, skip 10 spaces and start with t1
    /$reBothMissing/ or die("Incorrect data formatting");
    ($a1,$b1,$t[0],$v[0],$t[1],$v[1],$t[2],$v[2],$t[3],$v[3])=("    ","    ","$1","$2","$3","$4","$5","$6","$7","$8");
  }
  # Debug: print "a1:\"$a1\" b1:\"$b1\" t1:\"$t[0]\" v1:\"$v[0]\" t2:\"$t[1]\" v2:\"$v[1]\" t3:\"$t[2]\" v3:\"$v[2]\" t4:\"$t[3]\" v4:\"$v[3]\"\n";
  print "$a1 $b1 $t[0] $v[0]\n";
  #The number of spaces are important for formatting.  a1 and b1 are both 4 spaces, with a single space as a separator
  if($t[1]){print "          $t[1] $v[1]\n";}
  if($t[2]){print "          $t[2] $v[2]\n";}
  if($t[3]){print "          $t[3] $v[3]\n";}
}

sub a1IsEmpty{
  /^ {5}/;
}

sub b1IsEmpty{
  /^ {10}/;
}

Here is the output

$ perl table.reformat.pl table.reformat.txt 
[A1] [B1] [Tn]  [Vn]
   1 1000    0   100
            10   200
            20   300
            30   400
            40   500
            50   600
            60   700
            70   800
     1010    0   101
            10   201
            20   301
            30   401
            40   501
            50   601
   2 1000    0   110
            15   210
            25   310
            35   410
            45   510
            55   610
            65   710
     1010    0   150
            10   250
            20   350
            30   450
            40   550

Upvotes: -1

Ed Morton
Ed Morton

Reputation: 203899

Unlike what is stated elsewhere, there's nothing tricky about this at all, you're just using fixed width fields in your input instead of char/string separated fields.

With GNU awk for FIELDWIDTHS to handle fixed width fields it really couldn't be much simpler:

$ cat tst.awk
BEGIN {
    # define the width of the input and output fields
    FIELDWIDTHS = "2 4 5 5 6 5 6 5 6 5 6 99"
    ofmt = "%2s%5s%6s%5s%6s%s\n"
}
{
    # strip leading/trailing blanks and square brackets from every field
    for (i=1; i<=NF; i++) {
         gsub(/^[[\s]+|[]\s]+$/,"",$i)
    }
}
NR==1 {
    # print the header line
    printf ofmt, $1, $2, $3, "Tn", "Vn", " "$NF
    next
}
{
    # print every other line
    for (i=4; i<NF; i+=2) {
        printf ofmt, $1, $2, $3, $i, $(i+1), ""
        $1 = $2 = $3 = ""
    }
}

.

$ awk -f tst.awk file
##   A1    B1   Tn    Vn ## --> headers
      1  1000    0   100
                10   200
                20   300
                30   400
                40   500
                50   600
                60   700
                70   800
         1010    0   101
                10   201
                20   301
                30   401
                40   501
                50   601
      2  1000    0   110
                15   210
                25   310
                35   410
                45   510
                55   610
                65   710
         1010    0   150
                10   250
                20   350
                30   450
                40   550

With other awks you'd use a while() { substr() } loop instead of FIELDWIDTHS so it'd be a couple more lines of code but still trivial.

The above will be orders of magnitude faster than an equivalent shell script. See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

Upvotes: 4

potong
potong

Reputation: 58473

This might work for you (GNU sed):

sed -r '1d;s/^(.{11}).{11}/&\n\1/;s/^((.{5}).*\n)\2/\1     /;s/^(.{5}(.{6}).*\n.{5})\2/\1      /;/\S/P;D' file

Delete the first line (if the header is needed see below). The key fields occupy the first 11 (the first key is 5 characters and the second 6) characters and the data fields occupy the next 11. Insert a newline and the key fields before each pair of data fields. Compare the keys on adjacent lines and replace by spaces if they are duplicated. Do not print any blank lines.

If the header is needed, use the following:

sed -r '1{s/\[[^]]+\]\s*//5g;y/[]/  /;s/1/n/3g;s/B/ B/;G;b};s/^(.{11}).{11}/&\n\1/;s/^((.{5}).*\n)\2/\1     /;s/^(.{5}(.{6}).*\n.{5})\2/\1      /;/\S/P;D' file

This does additional formatting on the first line to remove superfluous headings, []'s, replace 1's by n, add an additional space for alignment and a following empty line.

Further more. By utilising the second line of the input file as a template for the data, a sed script can be created that does not have any hard coded values:

sed -r '2!d;s/\s*\S*//3g;s/.\>/&\n/;h;s/[^\n]/./g;G;s/[^\n.]/ /g;s#(.*)\n(.*)\n(.*)\n(.*)#1d;s/^(\1\2)\1\2/\&\\n\\1/;s/^((\1).*\\n)\\2/\\1\3/;s/^(\1(\2).*\\n\1)\\2/\\1\4/;/\\S/P;D#' file |
sed -r -f - file

The script created from the template is piped into a second invocation of the sed as a file and run against the original file to produce the required output.

Likewise the headers may be formatted if need be as so:

sed -r '2!d;s/\s*\S*//3g;s/.\>/&\n/;h;s/[^\n]/./g;G;s/[^\n.]/ /g;s#(.*)\n(.*)\n(.*)\n(.*)#s/^(\1\2)\1\2/\&\\n\\1/;s/^((\1).*\\n)\\2/\\1\3/;s/^(\1(\2).*\\n\1)\\2/\\1\4/;/\\S/P;D#' file |
sed -r -e '1{s/\[[^]]+\]\s*//5g;y/[]/  /;s/1/n/3g;s/B/ B/;G;b}' -f - file

By extracting the first four fields from the second line of the input file, Four variables can be made. Two regexp and two values. These variables can be used to build the sed script.

N.B. The sed script is created from strings extracted from the template and the variables produced are also strings so they can be concatenated to produce further new regexp's and new values etc etc

Upvotes: 1

David C. Rankin
David C. Rankin

Reputation: 84569

This is a rather tricky problem that can be handled a number of ways. Whether bash, perl or awk, you will need to handle to number of fields in some semi-generic way instead of just hardcoding values for your example.

Using bash, so long as you can rely on an even-number of fields in all lines (except for the lines with the sole initial value (e.g. 1010), you can accommodate the number of fields is a reasonably generic way. For the lines with 1, 2, etc.. you know your initial output will contain 4-fields. For lines with 1010, etc.. you know the output will contain an initial 3-fields. For the remaining values you are simply outputting pairs.

The tricky part is handling the alignment. Here is where printf which allows you to set the field-width with a parameter using the form "%*s" where the conversion specifier expects the next parameter to be an integer value specifying the field-width followed by a parameter for the string conversion itself. It takes a little gymnastics, but you could do something like the following in bash itself:

(note: edit to match your output header format)

#!/bin/bash

declare -i nfields wd=6     ## total no. fields, printf field-width modifier

while read -r line; do      ## read each line  (preserve for header line)
    arr=($line)             ## separate into array
    first=${arr[0]}         ## check for '#' in first line for header
    if [ "${first:0:1}" = '#' ]; then
        nfields=$((${#arr[@]} - 2))     ## no. fields in header
        printf "##   A1    B1   Tn    Vn ## --> headers\n"  ## new header
        continue
    fi
    fields=${#arr[@]}                   ## fields in line
    case "$fields" in
        $nfields )                      ## fields -eq nfiles?
            cnt=4                       ## handle 1st 4 values in line
            printf " "
            for ((i=0; i < cnt; i++)); do
                if [ "$i" -eq '2' ]; then
                    printf "%*s" "5" "${arr[i]}"
                else
                    printf "%*s" "$wd" "${arr[i]}"
                fi
            done
            echo
            for ((i = cnt; i < $fields; i += 2)); do    ## handle rest
                printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
            done
            ;;
        $((nfields - 1)) )              ## one less than nfields
            cnt=3                       ## handle 1st 3 values
            printf " %*s%*s" "$wd" " "
            for ((i=0; i < cnt; i++)); do
                if [ "$i" -eq '1' ]; then
                    printf "%*s" "5" "${arr[i]}"
                else
                    printf "%*s" "$wd" "${arr[i]}"
                fi
            done
            echo
            for ((i = cnt; i < $fields; i += 2)); do    ## handle rest
                if [ "$i" -eq '0' ]; then
                    printf "%*s%*s%*s\n" "$((wd+1))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
                else
                    printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
                fi
            done
            ;;
        * )     ## all other lines format as pairs
            for ((i = 0; i < $fields; i += 2)); do
                printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
            done
            ;;
    esac
done

Rather than reading from a file, just use redirection to redirect the input file to your script (if you want to just provide a filename, then redirect the file to feed the output while read... loop)

Example Use/Output

$ bash text1format.sh <dat/text1.txt
##   A1    B1   Tn    Vn ## --> headers
      1  1000    0   100
                10   200
                20   300
                30   400
                40   500
                50   600
                60   700
                70   800
         1010    0   101
                10   201
                20   301
                30   401
                40   501
                50   601
      2  1000    0   110
                15   210
                25   310
                35   410
                45   510
                55   610
                65   710
         1010    0   150
                10   250
                20   350
                30   450
                40   550

As between awk and bash, awk will generally be faster, but here with formatted output, it may be closer than usual. Look things over and let me know if you have questions.

Upvotes: 0

Jonathan Leffler
Jonathan Leffler

Reputation: 754400

This isn't easy because it is hard to identify when you have the different styles of row — those with values in both column 1 and column 2, those with no value in column 1 and a value in column 2, and those no value in column 1 or 2. A first step is to make this easier — sed to the rescue:

$ sed 's/[[:space:]]\{1,\}$//
s/^....../&|/
s/|....../&|/
:a
s/|\(  *[0-9][0-9]* \)\( *[^|]\)/|\1|\2/
t a' data
    1 | 1000 |   0 |  100 |  10 |  200 |  20 |  300 |  30 |  400
      |      |  40 |  500 |  50 |  600 |  60 |  700 |  70 |  800
      | 1010 |   0 |  101 |  10 |  201 |  20 |  301 |  30 |  401
      |      |  40 |  501 |  50 |  601
    2 | 1000 |   0 |  110 |  15 |  210 |  25 |  310 |  35 |  410
      |      |  45 |  510 |  55 |  610 |  65 |  710
      | 1010 |   0 |  150 |  10 |  250 |  20 |  350 |  30 |  450
      |      |  40 |  550
$

The first line removes any trailing white space, to avoid confusion. The next two expressions handle the fixed-width columns 1 and 2 (6 characters each). The next line creates a label a; the substitute finds a pipe |, some spaces, some digits, a space, and some trailing material which doesn't include a pipe; and inserts a pipe in the middle. The t a jumps back to the label if a substitution was done.

With that in place, it becomes easy to manage awk with a field separator of |. This is verbose, but seems to do the trick:

awk -F '|' '
$1 > 0 { printf "%5d  %4d  %3d  %3d\n", $1, $2, $3, $4
         for (i = 5; i <= NF; i += 2) { printf "%5s  %4s  %3d  %3d\n", "", "", $i, $(i+1) }
         next
       }
$2 > 0 { printf "%5s  %4d  %3d  %3d\n", "", $2, $3, $4
         for (i = 5; i <= NF; i += 2) { printf "%5s  %4s  %3d  %3d\n", "", "", $i, $(i+1) }
         next
       }
       { for (i = 3; i <= NF; i += 2) { printf "%5s  %4s  %3d  %3d\n", "", "", $i, $(i+1) }
         next
       }'

Output:

    1  1000    0  100
              10  200
              20  300
              30  400
              40  500
              50  600
              60  700
              70  800
       1010    0  101
              10  201
              20  301
              30  401
              40  501
              50  601
    2  1000    0  110
              15  210
              25  310
              35  410
              45  510
              55  610
              65  710
       1010    0  150
              10  250
              20  350
              30  450
              40  550

If you need to remove the headings, add 1d; to the start of the sed script.

Upvotes: 3

Related Questions