Parsing the parsed column inside awk

Question

I am trying to use awk for parsing a text file that looks like this:

001  data   John    Smith   address "London" | occupation "Driver" | exercise_level "Medium"
002  data   Rob Edward  address "Cardiff" | occupation "Physiotherapist" | exercise_level "High"
003  data   Dara    Pronk   address "Groningen" | country "Holland" | occupation "Teacher" | exercise_level "Low"
004  data   Marina  Francesca   address "Lugano" | country "Switzerland" | occupation "Chef" | exercise_level "High"

The first 4 columns are separated by tab and the 5th column has some metadata separated by pipes.

I want to get the "values" of the occupation "key" as my fifth column. My desired output will look like this:

001  data   John    Smith   Driver
002  data   Rob Edward  Physiotherapist
003  data   Dara    Pronk   Teacher
004  data   Marina  Francesca   Chef

I am able to get the occupation by this command:

awk -F'[	|]' '{for(i=5;i<=NF;i++){if($i~/^ occupation/){c=$i}} print $1, $2, $3, $4, c}' my_file

However, it will have both the key and value together (e.g. occupation "Physiotherapist" instead of just Physiotherapist). Is there a way to kind of parse the parsed column (i.e. parsing the value inside quotes), something like below?

awk -F'[	|]' '{for(i=5;i<=NF;i++){if($i~/^ occupation/){c=$i}} ((parse c here, take $2 of " delimiter)) print $1, $2, $3, $4, c}' my_file

Akshay Hegde · Accepted Answer

Using GNU awk

$ awk '{match($0,/occupation "([^"]*)"/,arr);print $1,$2,$3,$4,arr[1]}' infile
001 data John Smith Driver
002 data Rob Edward Physiotherapist
003 data Dara Pronk Teacher
004 data Marina Francesca Chef

Other awk

$ awk '{
         match($0,/occupation "([^"]*)"/); 
         s=substr($0,RSTART,RLENGTH); 
         gsub(/.* "|"/,"",s); 
         print $1,$2,$3,$4,s
}' infile
001 data John Smith Driver
002 data Rob Edward Physiotherapist
003 data Dara Pronk Teacher
004 data Marina Francesca Chef

Input:

$ cat infile
001  data   John    Smith   address "London" | occupation "Driver" | exercise_level "Medium"
002  data   Rob Edward  address "Cardiff" | occupation "Physiotherapist" | exercise_level "High"
003  data   Dara    Pronk   address "Groningen" | country "Holland" | occupation "Teacher" | exercise_level "Low"
004  data   Marina  Francesca   address "Lugano" | country "Switzerland" | occupation "Chef" | exercise_level "High"

--edit to address comment--

Just wondering, in the second option (other awk), is it possible to store other variables (e.g. occupation for var s and exercise_level for var e)?

modify variable search="...." according to your need, the order you input the same way it will give you result

awk -v search="occupation,exercise_level,address" '
BEGIN{
    split(search, arr, /,/) 
}
{
    str = "";
    for(i=1; i in arr; i++)
    {
          regexp = arr[i]" \"([^\"]*)\"";
          if(match($0,regexp)){ 
            s=substr($0,RSTART,RLENGTH); 
            gsub(/.* "|"/,"",s);
            str = (str ? str OFS : "") s 
           }
     }
         print $1,$2,$3,$4,str
}' infile

Parsing the parsed column inside awk

Answers (2)

Related Questions