Reputation: 65
I am trying to use awk for parsing a text file that looks like this:
001 data John Smith address "London" | occupation "Driver" | exercise_level "Medium"
002 data Rob Edward address "Cardiff" | occupation "Physiotherapist" | exercise_level "High"
003 data Dara Pronk address "Groningen" | country "Holland" | occupation "Teacher" | exercise_level "Low"
004 data Marina Francesca address "Lugano" | country "Switzerland" | occupation "Chef" | exercise_level "High"
The first 4 columns are separated by tab and the 5th column has some metadata separated by pipes.
I want to get the "values" of the occupation "key" as my fifth column. My desired output will look like this:
001 data John Smith Driver
002 data Rob Edward Physiotherapist
003 data Dara Pronk Teacher
004 data Marina Francesca Chef
I am able to get the occupation by this command:
awk -F'[\t|]' '{for(i=5;i<=NF;i++){if($i~/^ occupation/){c=$i}} print $1, $2, $3, $4, c}' my_file
However, it will have both the key and value together (e.g. occupation "Physiotherapist" instead of just Physiotherapist). Is there a way to kind of parse the parsed column (i.e. parsing the value inside quotes), something like below?
awk -F'[\t|]' '{for(i=5;i<=NF;i++){if($i~/^ occupation/){c=$i}} ((parse c here, take $2 of " delimiter)) print $1, $2, $3, $4, c}' my_file
Upvotes: 1
Views: 107
Reputation: 16997
Using GNU awk
$ awk '{match($0,/occupation "([^"]*)"/,arr);print $1,$2,$3,$4,arr[1]}' infile
001 data John Smith Driver
002 data Rob Edward Physiotherapist
003 data Dara Pronk Teacher
004 data Marina Francesca Chef
Other awk
$ awk '{
match($0,/occupation "([^"]*)"/);
s=substr($0,RSTART,RLENGTH);
gsub(/.* "|"/,"",s);
print $1,$2,$3,$4,s
}' infile
001 data John Smith Driver
002 data Rob Edward Physiotherapist
003 data Dara Pronk Teacher
004 data Marina Francesca Chef
Input:
$ cat infile
001 data John Smith address "London" | occupation "Driver" | exercise_level "Medium"
002 data Rob Edward address "Cardiff" | occupation "Physiotherapist" | exercise_level "High"
003 data Dara Pronk address "Groningen" | country "Holland" | occupation "Teacher" | exercise_level "Low"
004 data Marina Francesca address "Lugano" | country "Switzerland" | occupation "Chef" | exercise_level "High"
--edit to address comment--
Just wondering, in the second option (other awk), is it possible to store other variables (e.g. occupation for var s and exercise_level for var e)?
modify variable search="...."
according to your need, the order you input the same way it will give you result
awk -v search="occupation,exercise_level,address" '
BEGIN{
split(search, arr, /,/)
}
{
str = "";
for(i=1; i in arr; i++)
{
regexp = arr[i]" \"([^\"]*)\"";
if(match($0,regexp)){
s=substr($0,RSTART,RLENGTH);
gsub(/.* "|"/,"",s);
str = (str ? str OFS : "") s
}
}
print $1,$2,$3,$4,str
}' infile
Upvotes: 2
Reputation: 46826
Using any old awk (GNU works too but is not required):
$ awk -F'\t' '{split($5,a,/ *\| */); for (i in a) { split(a[i],b," "); d[b[1]]=b[2] } print $1 OFS $2 OFS $3 OFS $4 OFS d["occupation"]}' i
001 data John Smith "Driver"
002 data Rob Edward "Physiotherapist"
003 data Dara Pronk "Teacher"
004 data Marina Francesca "Chef"
Split out for easier reading (and commenting):
BEGIN {
OFS=FS='\t' # set the input field separator
}
{
split($5,a,/ *\| */) # split your embedded array by vertical bar
for (i in a) { # step through the array,
split(a[i],b," ") # splitting as you go
#gsub(/"/,"",b[2]) # optionally remove quotes
d[b[1]]=b[2] # and assigning indices in a new data array
}
print $1 OFS $2 OFS $3 OFS $4 OFS d["occupation"] # and print the result
}
While the extra step of split()
and the for
loop may look cumbersome, it has the advantage of making ALL your embedded data available by name in a handy array. (This addresses the request you made in comments on 3161993's answer.)
Note that at present, the split()
breaks on whitespace, so if you want to be able to handle data containing spaces (i.e. inside the quotes), a little more work will be required. If you want the output to be presented without quotes, you can gsub()
the data after assigning it, within the for loop (to remove all quotes) or use a pair of sub()
commands to remove leading and trailing quotes.
Upvotes: 0