Reputation: 447
I am trying to find all the places where my data has a repeating line and delete the repeating line. Also, I am looking for where the 2nd column has the value 90 and replace the following 2nd column with a specific number I designate.
My data looks like this:
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
7 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 31 0 0 0.0000 70221
I want my data to look like:
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 5 0 0 0.0000 70221
My code:
BEGIN {
priorline = "";
ERROROFFSET = 50;
ERRORVALUE[10] = 1;
ERRORVALUE[11] = 2;
ERRORVALUE[12] = 3;
ERRORVALUE[30] = 4;
ERRORVALUE[31] = 5;
ERRORVALUE[32] = 6;
ORS = "\n";
}
NR == 1 {
print;
getline;
priorline = $0;
}
NF == 6 {
brandnewline = $0
mytype = $2
$0 = priorline
priorField2 = $2;
if (mytype !~ priorField2) {
print;
priorline = brandnewline;
}
if (priorField2 == "90") {
mytype = ERRORVALUE[mytype];
}
}
END {print brandnewline}
##Here the parameters of the brandnewline is set to the current line and then the
##proirline is set to the line on which we just worked on and the brandnewline is
##set to be the next new line we are working on. (i.e line 1 = brandnewline, now
##we set priorline = brandnewline, thus priorline is line 1 and brandnewline takes
##on line 2) Next, the same parameters were set with column 2, mytype being the
##current column 2 value and priorField2 being the same value as mytype moves to
##the next column 2 value. Finally, we wrote an if statement where, if the value
##in column 2 of the current line !~ (does not equal) value of column two of the
##previous line, then the current line will be print otherwise it will just be
##skipped over. The second if statement recognizes the lines in which the value
##90 appeared and replaces the value in column 2 with a previously defined
##ERRORVALUE set for each specific type (type 10=1, 11=2,12=3, 30=4, 31=5, 32=6).
I have been able to successfully delete the repeating lines, however, I am unable to execute the next part of my code, which is to replace the values I designated in BEGIN as the ERRORVALUES (10=1, 11=2, 12=3, 30=4, 31=5, 32=6) with the actual columns that contain that value. Essentially, I want to just replace that value in the line with my ERRORVALUE.
If anyone can help me with this I would be very grateful.
Upvotes: 4
Views: 169
Reputation: 1045
The previous options work for the most part, however here's the way I would do it, simple and sweet. After reviewing the other posts I believe this would be the most efficient. In addition this also allows for the extra request the OP added in the comments to have the line after 90 replaced with a variable from 2 lines prior. This does it all in a single pass.
BEGIN {
PC2=PC6=1337
replacement=5
}
{
if( $6 == PC6 ) next
if( PC2 == 90 ) $2 = replacement
replacement = PC2
PC2 = $2
PC6 = $6
printf "%4s%8s%3s%5s%9s%6s\n",$1, $2, $3, $4, $5, $6
}
Example Input
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
7 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 31 0 0 0.0000 70221
Example Output
1 70 0 0 0.000000 57850
2 31 0 0 0.000000 59371
3 41 0 0 0.000000 60909
4 70 0 0 0.000000 61478
5 31 0 0 0.000000 62999
6 41 0 0 0.000000 64537
8 70 0 0 0.000000 65106
9 11 0 0 0.000000 66627
10 21 0 0 0.000000 68165
11 90 0 0 0.000000 68700
12 21 0 0 0.000000 70221
Upvotes: 0
Reputation: 54592
I agree with Glenn that two passes over the file is nicer. You can remove your duplicate, perhaps nonconsecutive, lines using a hash like this:
awk '!a[$2,$3,$4,$5,$6]++' file.txt
You should then edit your values as desired. If you wish to change the value 90
in the second column to 5000
, try something like this:
awk 'NR == 1 { print; next } { sub(/^90$/, "5000", $2); printf("%4i% 8i% 3i% 5i% 9.4f% 6i\n", $1, $2, $3, $4, $5, $6) }' file.txt
You can see that I stole Zsolt's printf statement (thanks Zsolt!) for the formatting, but you can edit this if necessary. You can also pipe the output from the first statement into the second for a nice one-liner:
cat file.txt | awk '!a[$2,$3,$4,$5,$6]++' | awk 'NR == 1 { print; next } { sub(/^90$/, "5000", $2); printf("%4i% 8i% 3i% 5i% 9.4f% 6i\n", $1, $2, $3, $4, $5, $6) }'
Upvotes: 0
Reputation: 51693
This might work for you:
awk 'BEGIN {
ERROROFFSET = 50;
ERRORVALUE[10] = 1;
ERRORVALUE[11] = 2;
ERRORVALUE[12] = 3;
ERRORVALUE[30] = 4;
ERRORVALUE[31] = 5;
ERRORVALUE[32] = 6;
}
NR == 1 { print ; next }
{ if (a[$2 $6]) { next } else { a[$2 $6]++ }
if ( $2 == 90) { print ; n++ ; next }
if (n>0) { $2 = ERRORVALUE[$2] ; n=0 }
printf("% 4i% 8i% 3i% 5i% 9.4f% 6i\n", $1, $2, $3, $4, $5, $6)
}' INPUTFILE
See it in action here at ideone.com.
IMO the BEGIN
block is obvious. Then the following happens:
NR == 1
line prints the very first line (and switches to the next line, also this rule only apply to the very first line)2 0020
concatenated is 20020
and it's the same for 20 020
) so you might want to add a column separatar in the index like a[$2 "-" $6]
... and you can use more columns to check even more properly)90
on the second column prints it, flags to swap on the next line then switch to next line (in the input file)ERRORVALUE
and if it finds, replaces its contents.Upvotes: 1
Reputation: 58578
This might work for you:
v=99999
sed ':a;$!N;s/^\(\s*\S*\s*\)\(.*\)\s*\n.*\2/\1\2/;ta;s/^\(\s*\S*\s*\) 90 /\1'"$(printf "%5d" $v)"' /;P;D' file
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 99999 0 0 0.0000 68700
12 31 0 0 0.0000 70221
Upvotes: 1
Reputation: 247260
One challenge is that you can't just compare one line with the previous because the ID number will be different.
awk '
BEGIN {
ERRORVALUE[10] = 1
# ... etc
}
# print the header
NR == 1 {print; next}
NR == 2 || $0 !~ prev_regex {
prev_regex = sprintf("^\\s+\\w+\\s+%s\\s+%s\\s+%s\\s+%s\\s+%s",$2,$3,$4,$5,$6)
if (was90) $2 = ERRORVALUE[$2]
print
was90 = ($2 == 90)
}
'
For lines where the 2nd column is altered, this ruins the line formatting:
# Type Response Acc RT Offset
1 70 0 0 0.0000 57850
2 31 0 0 0.0000 59371
3 41 0 0 0.0000 60909
4 70 0 0 0.0000 61478
5 31 0 0 0.0000 62999
6 41 0 0 0.0000 64537
8 70 0 0 0.0000 65106
9 11 0 0 0.0000 66627
10 21 0 0 0.0000 68165
11 90 0 0 0.0000 68700
12 5 0 0 0.0000 70221
If that's a problem, you could pipe the output of gawk into column -t
, or if you know the line format is fixed, use printf() in the awk program.
Upvotes: 2