Reputation: 133
I have 1 text file, which is test1.txt.
text1.txt contain as following:
Input:
##[A1] [B1] [T1] [V1] [T2] [V2] [T3] [V3] [T4] [V4]## --> headers
1 1000 0 100 10 200 20 300 30 400
40 500 50 600 60 700 70 800
1010 0 101 10 201 20 301 30 401
40 501 50 601
2 1000 0 110 15 210 25 310 35 410
45 510 55 610 65 710
1010 0 150 10 250 20 350 30 450
40 550
Condition:
A1 and B1 -> for each A1 + (B1 + [Tn + Vn])
A1 should be in 1 column.
B1 should be in 1 column.
T1,T2,T3 and T4 should be in 1 column.
V1,V2,V3 and V4 should be in 1 column.
How do I sort it become like below?
Desire Output:
## A1 B1 Tn Vn ## --> headers
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
Here is my current code:
First Attempt:
Input
cat test1.txt | awk ' { a=$1 b=$2 } { for(i=1; i<=5; i=i+1) { t=substr($0,11+i*10,5) v=substr($0,16+i*10,5) if( t ~ /^\ +[0-9]+$/ || t ~ /^[0-9]+$/ || t ~ /^\ +[0-9]+\ +$/ ){ printf "%7s %7d %8d %8d \n",a,b,t,v } }}' | less
Output:
1 1000 400 0
40 500 800 0
1010 0 401 0
2 1000 410 0
1010 0 450 0
I'm trying using simple awk command, but still can't get the result.
Can anyone help me on this?
Thanks,
Am
Upvotes: 2
Views: 345
Reputation: 67
Here is a Perl solution! I made a couple assumptions which I will list here.
]
except the A1 field. I assumed that was a copy/paste error and put it in line with the ]
. I also removed the ## and leading whitespace as I assumed this was not in the original file.Here is the input file with the changes listed above
$ more table.reformat.txt
[A1] [B1] [T1] [V1] [T2] [V2] [T3] [V3] [T4] [V4]
1 1000 0 100 10 200 20 300 30 400
40 500 50 600 60 700 70 800
1010 0 101 10 201 20 301 30 401
40 501 50 601
2 1000 0 110 15 210 25 310 35 410
45 510 55 610 65 710
1010 0 150 10 250 20 350 30 450
40 550
Because some data can be there and some data can be missing, I split it into 3 regular expressions to handle each case. I nearly had it working with one regular expression but the shortened rows with possibly trailing whitespace required special attention. A csv file would have been a lot easier. Instead I had to check for varying amounts of whitespace as a separator, as well as possibly missing fields in the front as well as the back.
Here is the code
#!/usr/bin/perl -w
my $reFullLine =qr"^([\w\W]{4}) ([\w\W]{4}) ([\w\W]{4}) ([\w\W]{5}) ([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?";
my $reOneMissing =qr"^ {5}([\w\W]{4}) ([\w\W]{4}) ([\w\W]{5}) ([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?";
my $reBothMissing=qr"^ {10}([\w\W]{4}) ([\w\W]{5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?([\w\W]{0,4}) ?([\w\W]{0,5}) ?";
while(<>){
my ($a1,$b1,@t,@v);
chomp;
s/ +$//; #remove trailing whitespace, this matters for columns with incomplete data
if(/\[/){ #header line, display new headers and go to next line
print "[A1] [B1] [Tn] [Vn]\n";
next;
}
if(!&a1IsEmpty($_) && !&b1IsEmpty($_)){ #a1 and b1 are there
/$reFullLine/ or die("Incorrect data formatting");
($a1,$b1,$t[0],$v[0],$t[1],$v[1],$t[2],$v[2],$t[3],$v[3])=("$1","$2","$3","$4","$5","$6","$7","$8","$9","$10");
} elsif( &a1IsEmpty($_) && !b1IsEmpty($_) ){ #a1 is empty and b1 is not, skip where a1 is (5 spaces) and start at b1
/$reOneMissing/ or die("Incorrect data formatting");
($a1,$b1,$t[0],$v[0],$t[1],$v[1],$t[2],$v[2],$t[3],$v[3])=(" ","$1","$2","$3","$4","$5","$6","$7","$8","$9");
} else{ #both a1 and b1 are empty, skip 10 spaces and start with t1
/$reBothMissing/ or die("Incorrect data formatting");
($a1,$b1,$t[0],$v[0],$t[1],$v[1],$t[2],$v[2],$t[3],$v[3])=(" "," ","$1","$2","$3","$4","$5","$6","$7","$8");
}
# Debug: print "a1:\"$a1\" b1:\"$b1\" t1:\"$t[0]\" v1:\"$v[0]\" t2:\"$t[1]\" v2:\"$v[1]\" t3:\"$t[2]\" v3:\"$v[2]\" t4:\"$t[3]\" v4:\"$v[3]\"\n";
print "$a1 $b1 $t[0] $v[0]\n";
#The number of spaces are important for formatting. a1 and b1 are both 4 spaces, with a single space as a separator
if($t[1]){print " $t[1] $v[1]\n";}
if($t[2]){print " $t[2] $v[2]\n";}
if($t[3]){print " $t[3] $v[3]\n";}
}
sub a1IsEmpty{
/^ {5}/;
}
sub b1IsEmpty{
/^ {10}/;
}
Here is the output
$ perl table.reformat.pl table.reformat.txt
[A1] [B1] [Tn] [Vn]
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
Upvotes: -1
Reputation: 203899
Unlike what is stated elsewhere, there's nothing tricky about this at all, you're just using fixed width fields in your input instead of char/string separated fields.
With GNU awk for FIELDWIDTHS to handle fixed width fields it really couldn't be much simpler:
$ cat tst.awk
BEGIN {
# define the width of the input and output fields
FIELDWIDTHS = "2 4 5 5 6 5 6 5 6 5 6 99"
ofmt = "%2s%5s%6s%5s%6s%s\n"
}
{
# strip leading/trailing blanks and square brackets from every field
for (i=1; i<=NF; i++) {
gsub(/^[[\s]+|[]\s]+$/,"",$i)
}
}
NR==1 {
# print the header line
printf ofmt, $1, $2, $3, "Tn", "Vn", " "$NF
next
}
{
# print every other line
for (i=4; i<NF; i+=2) {
printf ofmt, $1, $2, $3, $i, $(i+1), ""
$1 = $2 = $3 = ""
}
}
.
$ awk -f tst.awk file
## A1 B1 Tn Vn ## --> headers
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
With other awks you'd use a while() { substr() }
loop instead of FIELDWIDTHS
so it'd be a couple more lines of code but still trivial.
The above will be orders of magnitude faster than an equivalent shell script. See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
Upvotes: 4
Reputation: 58473
This might work for you (GNU sed):
sed -r '1d;s/^(.{11}).{11}/&\n\1/;s/^((.{5}).*\n)\2/\1 /;s/^(.{5}(.{6}).*\n.{5})\2/\1 /;/\S/P;D' file
Delete the first line (if the header is needed see below). The key fields occupy the first 11 (the first key is 5 characters and the second 6) characters and the data fields occupy the next 11. Insert a newline and the key fields before each pair of data fields. Compare the keys on adjacent lines and replace by spaces if they are duplicated. Do not print any blank lines.
If the header is needed, use the following:
sed -r '1{s/\[[^]]+\]\s*//5g;y/[]/ /;s/1/n/3g;s/B/ B/;G;b};s/^(.{11}).{11}/&\n\1/;s/^((.{5}).*\n)\2/\1 /;s/^(.{5}(.{6}).*\n.{5})\2/\1 /;/\S/P;D' file
This does additional formatting on the first line to remove superfluous headings, []
's, replace 1
's by n
, add an additional space for alignment and a following empty line.
Further more. By utilising the second line of the input file as a template for the data, a sed script can be created that does not have any hard coded
values:
sed -r '2!d;s/\s*\S*//3g;s/.\>/&\n/;h;s/[^\n]/./g;G;s/[^\n.]/ /g;s#(.*)\n(.*)\n(.*)\n(.*)#1d;s/^(\1\2)\1\2/\&\\n\\1/;s/^((\1).*\\n)\\2/\\1\3/;s/^(\1(\2).*\\n\1)\\2/\\1\4/;/\\S/P;D#' file |
sed -r -f - file
The script created from the template is piped into a second invocation of the sed as a file and run against the original file to produce the required output.
Likewise the headers may be formatted if need be as so:
sed -r '2!d;s/\s*\S*//3g;s/.\>/&\n/;h;s/[^\n]/./g;G;s/[^\n.]/ /g;s#(.*)\n(.*)\n(.*)\n(.*)#s/^(\1\2)\1\2/\&\\n\\1/;s/^((\1).*\\n)\\2/\\1\3/;s/^(\1(\2).*\\n\1)\\2/\\1\4/;/\\S/P;D#' file |
sed -r -e '1{s/\[[^]]+\]\s*//5g;y/[]/ /;s/1/n/3g;s/B/ B/;G;b}' -f - file
By extracting the first four fields from the second line of the input file, Four variables can be made. Two regexp and two values. These variables can be used to build the sed script.
N.B. The sed script is created from strings extracted from the template and the variables produced are also strings so they can be concatenated to produce further new regexp's and new values etc etc
Upvotes: 1
Reputation: 84569
This is a rather tricky problem that can be handled a number of ways. Whether bash
, perl
or awk
, you will need to handle to number of fields in some semi-generic way instead of just hardcoding values for your example.
Using bash, so long as you can rely on an even-number of fields in all lines (except for the lines with the sole initial value (e.g. 1010
), you can accommodate the number of fields is a reasonably generic way. For the lines with 1, 2
, etc.. you know your initial output will contain 4-fields
. For lines with 1010
, etc.. you know the output will contain an initial 3-fields
. For the remaining values you are simply outputting pairs.
The tricky part is handling the alignment. Here is where printf
which allows you to set the field-width with a parameter using the form "%*s"
where the conversion specifier expects the next parameter to be an integer
value specifying the field-width followed by a parameter for the string conversion itself. It takes a little gymnastics, but you could do something like the following in bash itself:
(note: edit to match your output header format)
#!/bin/bash
declare -i nfields wd=6 ## total no. fields, printf field-width modifier
while read -r line; do ## read each line (preserve for header line)
arr=($line) ## separate into array
first=${arr[0]} ## check for '#' in first line for header
if [ "${first:0:1}" = '#' ]; then
nfields=$((${#arr[@]} - 2)) ## no. fields in header
printf "## A1 B1 Tn Vn ## --> headers\n" ## new header
continue
fi
fields=${#arr[@]} ## fields in line
case "$fields" in
$nfields ) ## fields -eq nfiles?
cnt=4 ## handle 1st 4 values in line
printf " "
for ((i=0; i < cnt; i++)); do
if [ "$i" -eq '2' ]; then
printf "%*s" "5" "${arr[i]}"
else
printf "%*s" "$wd" "${arr[i]}"
fi
done
echo
for ((i = cnt; i < $fields; i += 2)); do ## handle rest
printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
done
;;
$((nfields - 1)) ) ## one less than nfields
cnt=3 ## handle 1st 3 values
printf " %*s%*s" "$wd" " "
for ((i=0; i < cnt; i++)); do
if [ "$i" -eq '1' ]; then
printf "%*s" "5" "${arr[i]}"
else
printf "%*s" "$wd" "${arr[i]}"
fi
done
echo
for ((i = cnt; i < $fields; i += 2)); do ## handle rest
if [ "$i" -eq '0' ]; then
printf "%*s%*s%*s\n" "$((wd+1))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
else
printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
fi
done
;;
* ) ## all other lines format as pairs
for ((i = 0; i < $fields; i += 2)); do
printf "%*s%*s%*s\n" "$((2*wd))" " " "$wd" "${arr[i]}" "$wd" "${arr[$((i+1))]}"
done
;;
esac
done
Rather than reading from a file, just use redirection to redirect the input file to your script (if you want to just provide a filename, then redirect the file to feed the output while read...
loop)
Example Use/Output
$ bash text1format.sh <dat/text1.txt
## A1 B1 Tn Vn ## --> headers
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
As between awk
and bash
, awk
will generally be faster, but here with formatted output, it may be closer than usual. Look things over and let me know if you have questions.
Upvotes: 0
Reputation: 754400
This isn't easy because it is hard to identify when you have the different styles of row — those with values in both column 1 and column 2, those with no value in column 1 and a value in column 2, and those no value in column 1 or 2. A first step is to make this easier — sed
to the rescue:
$ sed 's/[[:space:]]\{1,\}$//
s/^....../&|/
s/|....../&|/
:a
s/|\( *[0-9][0-9]* \)\( *[^|]\)/|\1|\2/
t a' data
1 | 1000 | 0 | 100 | 10 | 200 | 20 | 300 | 30 | 400
| | 40 | 500 | 50 | 600 | 60 | 700 | 70 | 800
| 1010 | 0 | 101 | 10 | 201 | 20 | 301 | 30 | 401
| | 40 | 501 | 50 | 601
2 | 1000 | 0 | 110 | 15 | 210 | 25 | 310 | 35 | 410
| | 45 | 510 | 55 | 610 | 65 | 710
| 1010 | 0 | 150 | 10 | 250 | 20 | 350 | 30 | 450
| | 40 | 550
$
The first line removes any trailing white space, to avoid confusion. The next two expressions handle the fixed-width columns 1 and 2 (6 characters each). The next line creates a label a
; the substitute finds a pipe |
, some spaces, some digits, a space, and some trailing material which doesn't include a pipe; and inserts a pipe in the middle. The t a
jumps back to the label if a substitution was done.
With that in place, it becomes easy to manage awk
with a field separator of |
.
This is verbose, but seems to do the trick:
awk -F '|' '
$1 > 0 { printf "%5d %4d %3d %3d\n", $1, $2, $3, $4
for (i = 5; i <= NF; i += 2) { printf "%5s %4s %3d %3d\n", "", "", $i, $(i+1) }
next
}
$2 > 0 { printf "%5s %4d %3d %3d\n", "", $2, $3, $4
for (i = 5; i <= NF; i += 2) { printf "%5s %4s %3d %3d\n", "", "", $i, $(i+1) }
next
}
{ for (i = 3; i <= NF; i += 2) { printf "%5s %4s %3d %3d\n", "", "", $i, $(i+1) }
next
}'
Output:
1 1000 0 100
10 200
20 300
30 400
40 500
50 600
60 700
70 800
1010 0 101
10 201
20 301
30 401
40 501
50 601
2 1000 0 110
15 210
25 310
35 410
45 510
55 610
65 710
1010 0 150
10 250
20 350
30 450
40 550
If you need to remove the headings, add 1d;
to the start of the sed
script.
Upvotes: 3