DSTO
DSTO

Reputation: 285

AWK: Create new columns based on existing columns

Below is how my file looks like (sorted on $3):

name_1|G1026|2017-08-27|2017-08-27|2017-09-02|19|19|21
name_2|G1566|2018-05-05|2018-05-05|2018-06-11|51|51|2B
name_2|G2124|2018-06-11|2018-06-11|2018-06-11|51|19|2B
name_2|G2125|2018-06-11|2018-06-11|2018-06-15|51|19|41
name_1|G4391|2020-08-14|2020-08-14|2020-08-20|19|19|21
name_1|G4392|2020-08-14|2020-08-20|2020-08-20|19|51|21
    

The fields separator is |. I am trying to add one extra column $9 to this file based on the existing columns. For multiple instances of names in $1, I want to apply the conditions below:

cond1 && (cond2 || cond3 || cond4) && (!cond5)

Let prev and cur be two rows with the same first field, a non-empty third field, and cur following prev. Rows 1,5 or 5,6 are pairs of such rows with first field name_1. Rows 2,3 and 3,4 are pairs of such rows with first field name_2.

Let delta = number-of-days(prev.$5 - cur.$4) be the number of days prev.$5 is past cur.$4.

The conditions are:

  1. cond1 = (0 <= delta <= 2 days)

    for example, for the 1st instance of name_1 (1st row), check if prev.$5 from 1st instance (1st row) is between 0 and 2 days later cur.$4 from 2nd instance (6th row).

  2. cond2 = (prev.$6 == 51)

  3. cond3 = (cur.$7 == 51)

  4. cond4 = (cur.$8 == "2B" || cur.$8 == 41)

  5. cond5 = (prev.$6 == 19 && cur.$7 == 51 && cur.$8 == 21)

If these conditions are met then add column $9 to the first of the two rows so the output would be like the one given below.

name_1|G1026|2017-08-27|2017-08-27|2017-09-02|19|19|21
name_2|G1566|2018-05-05|2018-05-05|2018-06-11|51|51|2B|group1
name_2|G2124|2018-06-11|2018-06-11|2018-06-11|51|19|2B|group2
name_2|G2125|2018-06-11|2018-06-11|2018-06-15|51|19|41
name_1|G4391|2020-08-14|2020-08-14|2020-08-20|19|19|21
name_1|G4392|2020-08-14|2020-08-20|2020-08-20|19|51|21

The added column starts with group1. The leading number increments each time a column is added.

If the required prev.$ and cur.$ values were in a single line then I could have have applied the below code:

awk -F "|" '{if ($1=="name_1" && (($5-$4)<=2) && ($6==51||$7==51||$8==2B|41) &&($6!=19 && $7!=51 && $8!=21)) print $9="group1"}' OFS="|" file

Any lead on how to solve this with awk would be highly appreciated!

Upvotes: 3

Views: 387

Answers (1)

Renaud Pacalet
Renaud Pacalet

Reputation: 29365

The following needs GNU awk extensions (mktime):

$ cat foo.awk
function d2ts(d) {
  gsub(/-/, " ", d)
  return mktime(d " 0 0 0")
}

BEGIN {
  f8["2B"] = 1;
  f8["41"] = 1;
}

FNR == NR {
  if($1 in ts && (f6[$1] == 51 || $7 == 51 || $8 in f8) &&
     !(f6[$1] == 19 && $7 == 51 && $8 == 21)) {
    delta = ts[$1] - d2ts($4)
    if(delta >= -12*3600 && delta <= 60*3600)
      change[nr[$1]] = 1
  }
  ts[$1] = d2ts($5)
  f6[$1] = $6
  nr[$1] = NR
  next
}

{
  if(FNR in change)
    $(NF+1) = "group" ++cnt
  print
}

$ awk -F'|' -f foo.awk OFS='|' file file
name_1|G1026|2017-08-27|2017-08-27|2017-09-02|19|19|21
name_2|G1566|2018-05-05|2018-05-05|2018-06-11|51|51|2B|group1
name_2|G2124|2018-06-11|2018-06-11|2018-06-11|51|19|2B|group2
name_2|G2125|2018-06-11|2018-06-11|2018-06-15|51|19|41
name_1|G4391|2020-08-14|2020-08-14|2020-08-20|19|19|21
name_1|G4392|2020-08-14|2020-08-20|2020-08-20|19|51|21

We proceed in 2 phases, reason why file is passed twice. The first pass checks all your conditions and stores the numbers of the record to modify in associative array change. The second phase adds the last column to all records which number is a key of the change associative array.

Note: in order to account for daylight savings time and leap seconds the date comparison does not use 0 and 2 days thresholds but minus half a day (-12*3600) and 2 days and a half (60*3600). As your date fields have only a 1 day resolution this should behave as expected.

Explanations:

  • The d2ts function converts the YYYY-MM-DD dates to UNIX timestamps, that is seconds since 1970/01/01. This is done by first using gsub to convert YYYY-MM-DD to YYYY MM DD (spaces instead of -), concatenating 0 0 0 for hours, minutes, seconds, and then converting to UNIX timestamp with mktime.

  • As your algorithm refers to past lines we use associative arrays (ts, f6 and nr) to store information about the last encountered lines with a given field #1 value. The keys are the field #1 values (name_X) and the values are, respectively, the UNIX timestamp of field #5, the value of field #6 and the record number.

  • As you have more than one candidate value for field #8 we define another associative array (f8) in the BEGIN section, and use the in operator for the test.

Of course, depending on your files, you could encounter memory issues during the first phase. If you have billions of different name_X values, for instance, some adjustments will probably be needed to avoid a memory overflow due to the size of the associative arrays.

Upvotes: 4

Related Questions