Reputation: 2602

Extract html tags' attributes

I'm looking for the easiest way with awk to parse this HTML snippet:

<a id=1 data1="sth11" data2="sth12" data3="sth13 "><div class="cl1"></div></a> ;
<a id=2 data1="sth21" data2="sth22" data3=" sth23"><div class="cl2"></div></a>   ;
<a id=2 data1="sth31" data3="  sth33  " data2="sth32" ><div class="cl3"></div></a>  ;

Into this (concatenation of data3 attributes and separate them with ;):

sth13;sth23;sth33;

I tried to browse awk's guides but it seems too huge and although this seems a simple problem, I still haven't found the perfect solution yet.

Would be great to have the solution along with explanation and some source if I need something else or similar to spare asking every time.

I've tried a simple one but this one is not good, as field is fixed thus not concatenating the ; and also not trimming the spaces:

cat data | awk -F'"' '/data3=/{print $6}'

Thank you

Upvotes: 0

Answers (4)

Sundeep

Reputation: 23697

If the column number isn't fixed (just noticed that OP's input has data2/data3 switched for last line):

$ awk -v ORS=';' 'match($0, /data3="[^"]+"/){
                  m = substr($0, RSTART+7, RLENGTH-8);
                  gsub(/^ +| +$/, "", m); print m}' ip.txt 
sth13;sth23;sth33;

-v ORS=';' will change output record separator to ; instead of newline
match($0, /data3="[^"]+"/) will match a line containing data3=" followed non " characters and a " character
m = substr($0, RSTART+7, RLENGTH-8) will extract the matched portion, minus data3=" and the last " character
gsub(/^ +| +$/, "", m) will remove spaces from start/end of the string in m

Modifying F. Knorr's solution:

awk -F'data3=" *' -v ORS=';' 'NF>1{sub(/ *".*/, "", $2); print $2}'

-F'data3=" *' will use data3=" followed by optional spaces as field separator
NF>1 will make sure only a line containing data3=" is selected
sub(/ *".*/, "", $2) will remove optional space and remaining characters from the line

For multiple matches:

awk -F'data3=" *' -v ORS=';' '{for(i=2; i<=NF; i++){sub(/ *".*/, "", $i); print $i}}'

Upvotes: 1

karakfa

Reputation: 67567

$ grep -oP '(?<=data3=")[^"]*(?=")' file | 
  sed -E 's/^ +//;s/ +$//'               | 
  paste -sd';'

sth13;sth23;sth33

extract quoted string next to data3=; trim extra whitespace; concatenate the results.

Upvotes: 0

F. Knorr

Reputation: 3065

First, I would strongly advise against using for XML-processing. There are better tools out there.

For the example you have provided this command would probably yield the desired output:

awk -F 'data3="|>' 'BEGIN{ORS=";"}{sub(/^ +/,"",$2); sub(/[ "].*/,"",$2); print $2}' file

Output:

sth13;sth23;sth33;

Demo: https://awk.js.org/?gist=192c1bf336fbf175ab1c143d5f92e50f

Upvotes: 1

Raman Sailopal

Reputation: 12917

AS others have suggested, a dedicated html/xml parser would be the best solution for this but if you cannot use one, you can try the following GNU awk solution:

awk -F '[ >]' '{ gsub("data3=\"[[:space:]]+","data3=\"",$0);gsub("[[:space:]]+\"","\"",$0);for (i=1;i<=NF;i++) { if ($i ~ /data3/) { split($i,map,"=");gsub("\"","",map[2]);printf "%s;",map[2] } } }' file

Explantion:

awk -F '[ >]' '{                                                          # Set the field delimiter to space or ">"
                 gsub("data3=\"[[:space:]]+","data3=\"",$0);              # Remove any space in the data3 element definition
                 gsub("[[:space:]]+\"","\"",$0);
                 for (i=1;i<=NF;i++) { 
                   if ($i ~ /data3/) { 
                     split($i,map,"=");                                    # Loop through each field and process is it if it contains data3, split the field in the array map using "=" as the delimiter
                     gsub("\"","",map[2]);                                 # Remove quotes from the the second index of map
                     printf "%s;",map[2]                                   # Print the result
                   } 
                  } 
                 }' file

Upvotes: 0

Extract html tags&#39; attributes

Answers (4)

Related Questions

Extract html tags' attributes