Reputation: 2602
I'm looking for the easiest way with awk to parse this HTML snippet:
<a id=1 data1="sth11" data2="sth12" data3="sth13 "><div class="cl1"></div></a> ;
<a id=2 data1="sth21" data2="sth22" data3=" sth23"><div class="cl2"></div></a> ;
<a id=2 data1="sth31" data3=" sth33 " data2="sth32" ><div class="cl3"></div></a> ;
Into this (concatenation of data3 attributes and separate them with ;
):
sth13;sth23;sth33;
I tried to browse awk's guides but it seems too huge and although this seems a simple problem, I still haven't found the perfect solution yet.
Would be great to have the solution along with explanation and some source if I need something else or similar to spare asking every time.
I've tried a simple one but this one is not good, as field is fixed thus not concatenating the ;
and also not trimming the spaces:
cat data | awk -F'"' '/data3=/{print $6}'
Thank you
Upvotes: 0
Views: 345
Reputation: 23697
If the column number isn't fixed (just noticed that OP's input has data2/data3 switched for last line):
$ awk -v ORS=';' 'match($0, /data3="[^"]+"/){
m = substr($0, RSTART+7, RLENGTH-8);
gsub(/^ +| +$/, "", m); print m}' ip.txt
sth13;sth23;sth33;
-v ORS=';'
will change output record separator to ;
instead of newlinematch($0, /data3="[^"]+"/)
will match a line containing data3="
followed non "
characters and a "
characterm = substr($0, RSTART+7, RLENGTH-8)
will extract the matched portion, minus data3="
and the last "
charactergsub(/^ +| +$/, "", m)
will remove spaces from start/end of the string in m
Modifying F. Knorr's solution:
awk -F'data3=" *' -v ORS=';' 'NF>1{sub(/ *".*/, "", $2); print $2}'
-F'data3=" *'
will use data3="
followed by optional spaces as field separatorNF>1
will make sure only a line containing data3="
is selectedsub(/ *".*/, "", $2)
will remove optional space and remaining characters from the lineFor multiple matches:
awk -F'data3=" *' -v ORS=';' '{for(i=2; i<=NF; i++){sub(/ *".*/, "", $i); print $i}}'
Upvotes: 1
Reputation: 67567
$ grep -oP '(?<=data3=")[^"]*(?=")' file |
sed -E 's/^ +//;s/ +$//' |
paste -sd';'
sth13;sth23;sth33
extract quoted string next to data3=
; trim extra whitespace; concatenate the results.
Upvotes: 0
Reputation: 3065
First, I would strongly advise against using for XML-processing. There are better tools out there.
For the example you have provided this command would probably yield the desired output:
awk -F 'data3="|>' 'BEGIN{ORS=";"}{sub(/^ +/,"",$2); sub(/[ "].*/,"",$2); print $2}' file
Output:
sth13;sth23;sth33;
Demo: https://awk.js.org/?gist=192c1bf336fbf175ab1c143d5f92e50f
Upvotes: 1
Reputation: 12917
AS others have suggested, a dedicated html/xml parser would be the best solution for this but if you cannot use one, you can try the following GNU awk solution:
awk -F '[ >]' '{ gsub("data3=\"[[:space:]]+","data3=\"",$0);gsub("[[:space:]]+\"","\"",$0);for (i=1;i<=NF;i++) { if ($i ~ /data3/) { split($i,map,"=");gsub("\"","",map[2]);printf "%s;",map[2] } } }' file
Explantion:
awk -F '[ >]' '{ # Set the field delimiter to space or ">"
gsub("data3=\"[[:space:]]+","data3=\"",$0); # Remove any space in the data3 element definition
gsub("[[:space:]]+\"","\"",$0);
for (i=1;i<=NF;i++) {
if ($i ~ /data3/) {
split($i,map,"="); # Loop through each field and process is it if it contains data3, split the field in the array map using "=" as the delimiter
gsub("\"","",map[2]); # Remove quotes from the the second index of map
printf "%s;",map[2] # Print the result
}
}
}' file
Upvotes: 0