kaka01
kaka01

Reputation: 65

Parsing the parsed column inside awk

I am trying to use awk for parsing a text file that looks like this:

001  data   John    Smith   address "London" | occupation "Driver" | exercise_level "Medium"
002  data   Rob Edward  address "Cardiff" | occupation "Physiotherapist" | exercise_level "High"
003  data   Dara    Pronk   address "Groningen" | country "Holland" | occupation "Teacher" | exercise_level "Low"
004  data   Marina  Francesca   address "Lugano" | country "Switzerland" | occupation "Chef" | exercise_level "High"

The first 4 columns are separated by tab and the 5th column has some metadata separated by pipes.

I want to get the "values" of the occupation "key" as my fifth column. My desired output will look like this:

001  data   John    Smith   Driver
002  data   Rob Edward  Physiotherapist
003  data   Dara    Pronk   Teacher
004  data   Marina  Francesca   Chef

I am able to get the occupation by this command:

awk -F'[\t|]' '{for(i=5;i<=NF;i++){if($i~/^ occupation/){c=$i}} print $1, $2, $3, $4, c}' my_file

However, it will have both the key and value together (e.g. occupation "Physiotherapist" instead of just Physiotherapist). Is there a way to kind of parse the parsed column (i.e. parsing the value inside quotes), something like below?

awk -F'[\t|]' '{for(i=5;i<=NF;i++){if($i~/^ occupation/){c=$i}} ((parse c here, take $2 of " delimiter)) print $1, $2, $3, $4, c}' my_file

Upvotes: 1

Views: 107

Answers (2)

Akshay Hegde
Akshay Hegde

Reputation: 16997

Using GNU awk

$ awk '{match($0,/occupation "([^"]*)"/,arr);print $1,$2,$3,$4,arr[1]}' infile
001 data John Smith Driver
002 data Rob Edward Physiotherapist
003 data Dara Pronk Teacher
004 data Marina Francesca Chef

Other awk

$ awk '{
         match($0,/occupation "([^"]*)"/); 
         s=substr($0,RSTART,RLENGTH); 
         gsub(/.* "|"/,"",s); 
         print $1,$2,$3,$4,s
}' infile
001 data John Smith Driver
002 data Rob Edward Physiotherapist
003 data Dara Pronk Teacher
004 data Marina Francesca Chef

Input:

$ cat infile
001  data   John    Smith   address "London" | occupation "Driver" | exercise_level "Medium"
002  data   Rob Edward  address "Cardiff" | occupation "Physiotherapist" | exercise_level "High"
003  data   Dara    Pronk   address "Groningen" | country "Holland" | occupation "Teacher" | exercise_level "Low"
004  data   Marina  Francesca   address "Lugano" | country "Switzerland" | occupation "Chef" | exercise_level "High"

--edit to address comment--

Just wondering, in the second option (other awk), is it possible to store other variables (e.g. occupation for var s and exercise_level for var e)?

modify variable search="...." according to your need, the order you input the same way it will give you result

awk -v search="occupation,exercise_level,address" '
BEGIN{
    split(search, arr, /,/) 
}
{
    str = "";
    for(i=1; i in arr; i++)
    {
          regexp = arr[i]" \"([^\"]*)\"";
          if(match($0,regexp)){ 
            s=substr($0,RSTART,RLENGTH); 
            gsub(/.* "|"/,"",s);
            str = (str ? str OFS : "") s 
           }
     }
         print $1,$2,$3,$4,str
}' infile

Upvotes: 2

ghoti
ghoti

Reputation: 46826

Using any old awk (GNU works too but is not required):

$ awk -F'\t' '{split($5,a,/ *\| */); for (i in a) { split(a[i],b," "); d[b[1]]=b[2] } print $1 OFS $2 OFS $3 OFS $4 OFS d["occupation"]}' i
001 data John Smith "Driver"
002 data Rob Edward "Physiotherapist"
003 data Dara Pronk "Teacher"
004 data Marina Francesca "Chef"

Split out for easier reading (and commenting):

BEGIN {
  OFS=FS='\t'           # set the input field separator
} 

{
  split($5,a,/ *\| */)  # split your embedded array by vertical bar
  for (i in a) {        # step through the array,
    split(a[i],b," ")   # splitting as you go
    #gsub(/"/,"",b[2])  # optionally remove quotes
    d[b[1]]=b[2]        # and assigning indices in a new data array
  }
  print $1 OFS $2 OFS $3 OFS $4 OFS d["occupation"]     # and print the result
}

While the extra step of split() and the for loop may look cumbersome, it has the advantage of making ALL your embedded data available by name in a handy array. (This addresses the request you made in comments on 3161993's answer.)

Note that at present, the split() breaks on whitespace, so if you want to be able to handle data containing spaces (i.e. inside the quotes), a little more work will be required. If you want the output to be presented without quotes, you can gsub() the data after assigning it, within the for loop (to remove all quotes) or use a pair of sub() commands to remove leading and trailing quotes.

Upvotes: 0

Related Questions