Reputation: 235

Substitute last 4 digits in second and third column

I have a file as following:

2300
10     1112221234     111222123420231121PPPPD10+0000000850      ESIM
10     3334446789     333444678920231121PPPPD11+0000000950      RSIM
23

I want the outcome to be as following:

2300
10     1112222345     111222234520231121PPPPD10+0000000850      ESIM
10     3334447890     333444789020231121PPPPD11+0000000950      RSIM
23

I tried with the following code and was able to replace the last 4 digits in the second column and the last 4 digits before the date in the third column. But it also removed extra spaces as well as alphabets/numbers from 11th digit onwards in the third column and got the following:

2300
10 1112222345 1112222345 ESIM
10 3334447890 3334447890 RSIM
23

awk '
BEGIN { FS=OFS=" " }
{if(length($2)>9 && length($3)>9)
   {$2 = substr($2,-10)
   $3 = substr($3,1,10) 
    for (i=2;i<=3;i++) {                                   
        str = substr($i, 1, length($i) - 4)                 
        for (j = length($i) - 3; j <= length($i); j++) {    
            str = str (substr($i, j, 1) + 1) % 10           
        }
        $i = str                                            
    }
}}
1' filename

Upvotes: 11

Answers (4)

Renaud Pacalet

Reputation: 29345

It is not clear if the characters to replace in the third field are always characters 7 to 10 (CASE 1) or if the third field always starts with digits, and the date part is always the last 8 digits before the first non-digit character, as in your example (CASE 2). Let's deal with both.

Your problem comes from the fact that you update the fields, which forces awk to recompute the record using the output field separator (OFS), that is, a single space, instead of the original separators. Moreover, you overwrite $2 and $3 with the substr(...) results to keep only 10 characters, discarding the others, which is not what you want.

To not discard parts of the second and third fields, well... don't discard them. To preserve the original field separators there are several options but the easiest to understand and design is probably to update the complete record ($0), instead of individual fields. Example for CASE 1:

awk 'length($2)>9 && length($3)>9 {
  match($0,/^([[:space:]]*[^[:space:]]+){2}/); a[1]=RLENGTH-3
  match($0,/^([[:space:]]*[^[:space:]]+){2}[[:space:]]+/); a[2]=RLENGTH+7
  for(i=1; i<=2; i++) for(j=a[i]; j<a[i]+4; j++)
    $0=substr($0,1,j-1) (substr($0,j,1)+1)%10 substr($0,j+1)
} 1' filename
2300
10     1112222345     111222234520231121PPPPD10+0000000850      ESIM
10     3334447890     333444789020231121PPPPD11+0000000950      RSIM
23

Explanations: we use match to find the index of the last character of the second field and of the last whitespace before the third field. We then adjust these to point to the first character to replace in the two substitutions (a[1] and a[2]).

Example for CASE 2:

awk 'length($2)>9 && $3~/^[0-9]{12}/ {
  match($0,/^([[:space:]]*[^[:space:]]+){2}/); a[1]=RLENGTH-3
  match($0,/^([[:space:]]*[^[:space:]]+){2}[[:space:]]+[0-9]+/); a[2]=RLENGTH-11
  for(i=1; i<=2; i++) for(j=a[i]; j<a[i]+4; j++)
    $0=substr($0,1,j-1) (substr($0,j,1)+1)%10 substr($0,j+1)
} 1' filename

Explanations: we modify the pattern part to retain only records which third field has at least 12 leading digits (the 4 digits to replace plus an 8 digits date). Same as before for the second field, but for the third field we search the last leading digit and adjust by -11 to point to the first digit to replace.

If your awk is GNU awk we can replace [[:space:]] with \s and [^[:space:]] with \S. But we can do even better, with only one match, thanks to the optional third argument of the GNU awk version of match: an array in which GNU awk stores the capture groups.

Example for CASE 1:

awk 'length($2)>9 && length($3)>9 {
  match($0,/^(\s*\S+\s+\S+)(\S{3}\s+\S{7})/,b)
  a[1]=length(b[1]); a[2]=a[1]+length(b[2])
  for(i=1; i<=2; i++) for(j=a[i]; j<a[i]+4; j++)
    $0=substr($0,1,j-1) (substr($0,j,1)+1)%10 substr($0,j+1)
} 1' filename

With the same CASE 1 FPAT is another interesting GNU awk feature that allows to redefine the fields such that they contain also the following separator. This probably leads to the simplest of all solutions:

awk -v FPAT='[^[:space:]]+[[:space:]]*' '$2~/^\S{10}/ && $3~/^\S{10}/ {
  for(i=2; i<=3; i++) for(j=7; j<=10; j++)
    $i=substr($i,1,j-1) (substr($i,1,j)+1)%10 substr($i,j+1)
} 1' filename

Example for CASE 2:

awk 'length($2)>9 && $3~/^[0-9]{12}/ {
  match($0,/^(\s*\S+\s+\S+)(\S{3}\s+[0-9]+)[0-9]{11}/,b)
  a[1]=length(b[1]); a[2]=a[1]+length(b[2])
  for(i=1; i<=2; i++) for(j=a[i]; j<a[i]+4; j++)
    $0=substr($0,1,j-1) (substr($0,j,1)+1)%10 substr($0,j+1)
} 1' filename

Note: in all examples, except the one using FPAT, the regular expressions used in match match lines with leading whitespaces (spaces, TABs and newlines). Remove the leading [[:space:]]* or \s* if you want to skip lines with leading whitespaces.

Note: FS=OFS=" " is already the default, so in your own code the BEGIN block is useless.

Upvotes: 0

jared_mamrot

Reputation: 26695

If you capture each 'part of interest' from columns $2 and $3, then increment the 4 digits, then use printf to print the lines, you can get your desired outcome, e.g.

awk 'BEGIN {
    FS = OFS = " "
}

{
    if (length($2) > 9 && length($3) > 9) {
        col2_first_part = substr($2, 0, 6)
        col2_4_digits = substr($2, 7, 4)
        col3_first_part = substr($3, 0, 6)
        col3_4_digits = substr($3, 7, 4)
        col3_last_part = substr($3, 11, length($3) - 10)
        printf "%s\t%s", $1, col2_first_part
        for (i = 1; i <= 4; i++) {
            printf "%s", (substr(col2_4_digits, i, 1) + 1) % 10
        }
        printf "\t%s", col3_first_part
        for (j = 1; j <= 4; j++) {
            printf "%s", (substr(col3_4_digits, j, 1) + 1) % 10
        }
        printf "%s\t", col3_last_part
        for (k = 4; k <= NF; k++) {
            printf "%s%s", $k, (k < NF ? "\t" : "\n")
        }
    } else {
        print
    }
}' filename
2300
10  1112222345  111222234520231121PPPPD10+0000000850    ESIM
10  3334447890  333444789020231121PPPPD11+0000000950    RSIM
23

Upvotes: 4

markp-fuso

Reputation: 35306

Assumptions:

the string of interest (old) is the entire 2nd column
old is also the prefix of the 3rd column
old only shows up twice in a line (as 2nd column, as prefix of 3rd column)
lines of interest have 4 space-delimited columns
need to maintain spacing as it exists in the input

One awk idea:

awk '
NF==4 { old  = $2
        len  = length(old)
        new  = substr(old,1,len-4)
        for (i=len-3; i<=len; i++)
            new = new ((substr(old,i,1)+1) % 10)
        gsub(old,new)                              # replace both instances of "old" with "new"
      }
1
' filename

This generates:

2300
10     1112222345     111222234520231121PPPPD10+0000000850      ESIM
10     3334447890     333444789020231121PPPPD11+0000000950      RSIM
23

Upvotes: 4

RavinderSingh13

Reputation: 133750

In GNU awk please try following GNU awk code. Written and tested with shown samples.

awk -v OFS="\t" '
match($2,/(.*)([0-9])([0-9])([0-9])([0-9])$/,arr){
  if(arr[3]==9)     { val1=(arr[2] arr[3]) + 1                                }
  if(arr[5]==9)     { val2=(arr[4] arr[5]) + 1                                }
  if(val1 && !val2) { $2= arr[1] val1 arr[4]+1 arr[5]+1                       }
  if(val2 && !val1) { $2 = arr[1]  arr[2]+1 arr[3]+1  val2                    }
  if(val1 && val2)  {  $2 = arr[1] val1 val2                                  }
  if(!val1 && !val2){ $2 = arr[1] arr[2]+1 arr[3]+1 arr[4]+1 arr[5]+1         }
}
match($3,/(^.{6})([0-9])([0-9])([0-9])([0-9])(.*$)/,arr){
   if(arr[3]==9)     { val1=(arr[2] arr[3]) + 1                               }
   if(arr[5]==9)     { val2=(arr[4] arr[5]) + 1                               }
   if(val1 && !val2) { $3= arr[1] val1 arr[4]+1 arr[5]+1 arr[6]               }
   if(val2 && !val1) { $3 = arr[1]  arr[2]+1 arr[3]+1  val2 arr[6]            }
   if(val1 && val2)  { $3 = arr[1] val1 val2 arr[6]                           }
   if(!val1 && !val2){ $3 = arr[1] arr[2]+1 arr[3]+1 arr[4]+1 arr[5]+1 arr[6] }
}
1
' Input_file | column -t

Upvotes: 9

Substitute last 4 digits in second and third column

Answers (4)

Related Questions