Reputation: 858
I have the following HTML strcuture and want to extract data from it using the awk
.
<body>
<div>...</div>
<div>...</div>
<div class="body-content">
<div>...</div>
<div class="product-list" class="container">
<div class="w3-row" id="product-list-row">
<div class="w3-col m2 s4">
<div class="product-cell">
<div class="product-title">Product A</div>
<div class="product-price">100,56</div>
</div>
</div>
<div class="w3-col m2 s4">
<div class="product-cell">
<div class="product-title">Product B</div>
<div class="product-price">200,56</div>
</div>
</div>
<div class="w3-col m2 s4">
<div class="product-cell">
<div class="product-title">Product C</div>
<div class="product-price">300,56</div>
</div>
</div>
<div class="w3-col m2 s4">
<div class="product-cell">
<div class="product-title">Product D</div>
<div class="product-price">400,56</div>
</div>
</div>
</div>
</div>
</div>
</body>
The result I want to have is as follows.
100,56
200,56
300,56
400,56
I was experimenting with the following awk
script (I know it makes no sense to select product-price
twice, I was about to modify this script)
awk -F '<[^>]+>' 'found { sub(/^[[:space:]]*/,";"); print title $0; found=0 } /<div class="product-price">/ { title=$2 } /<div class="product-price">/ { found=1 }'
but it gives me the result
100,56 </div>
200,56 </div>
300,56 </div>
400,56 </div>
I never used awk
before, so can't just figure out what is wrong here or how to modify the above code. How would you do this?
Upvotes: 2
Views: 2917
Reputation: 3423
It baffles me that time and time again people try to parse HTML, not with an HTML parser, but with a tool that doesn't understand HTML at all in general and with RegEx in particular!
With an HTML parser like xidel it's as simple as:
$ xidel -s "<url> or input.html" -e '//div[@class="product-price"]'
Upvotes: 2
Reputation: 36370
How would you do this?
If possible use tool designed for dealing with HTML, which GNU AWK
is not.
If you are allowed to install then use hxselect it does process standard input and understand (subset) of CSS selectors, so in this case something like:
echo file.html | hxselect -i -c -s '\n' div.product-price
should give you desired result (disclaimer: I do not have ability to test it)
Upvotes: 2
Reputation: 133428
If someone is looking for Python related solution, I would suggest use beautifulsoup library of Python, following is written and tested in Python3.8. To segregate it from my previous answer I am adding another answer here.
#!/bin/python3
##import library here.
from bs4 import BeautifulSoup
##Read Input_file and get its all contents.
with open('Input_file', 'r') as f:
contents = f.read()
f.close()
##Get contents in form of xml in soup variable here.
soup = BeautifulSoup(contents, 'lxml')
##get only those values which specifically needed by OP of div class.
vals = (soup.find_all("div", {"class": "product-price"}))
##Print actual values out of tags.
for val in vals:
print (val.text)
NOTE:
lxml
with pip3 or pip depending upon your system.Upvotes: 3
Reputation: 133428
With your shown samples/attempts, please try following awk
code.
awk -F"[><]" '{gsub(/\r/,"")} /^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{print $3}' Input_file
Explanation: Adding detailed explanation for above. This is only for explanation purposes for running code please use above one.
awk -F"[><]" ' ##Starting awk program from here and setting field separator as ><
{gsub(/\r/,"")} ##Substituting control M chars at last of lines.
/^[ \t]+<div[ \t]+class="product-price">.*<\/div>/{ ##checking condition if line starts
##from space followed by <div class=product-price"> till div close tag.
print $3 ##printing 3rd column here.
}
' Input_file ##Mentioning Input_file name here.
Changed regex to /^[ \t]+<div[ \t]+class
as per Ed's suggestions in comments. Also its always recommended by experts to use xmlstarlet/xml aware tools in case someone has in their system.
Upvotes: 3
Reputation: 203209
The result of a quick google for xmlstarlet print div contents
and then a few secs of trial and error:
$ xmlstarlet sel -t -m "//*[@class='product-price']" -v "." -n file
100,56
200,56
300,56
400,56
For an explanation - ask google :-).
Upvotes: 3