Reputation: 43
I'm trying to extract a specific piece of information from a website, but the content seems to be included in the class definition:
<div class= "some_div_class">
<strong content="999" itemprop="price" class="strong_class">
999
</strong>
</div>
I'm targeting the "999", which I can if I do:
curl -s url |grep -zPo '<strong content="999" itemprop="price" class="strong_class">\s*\K.*?(?=\s*</strong>)'
If the "999" is in the content though, and it changes, grep would become invalid. Wildcards wouldn't return anything
Upvotes: 0
Views: 803
Reputation: 3443
Please(!) have a look at the following urls before you attempt to parse a website with RegEx:
With an HTML/XML parser like xidel it's as simple as:
xidel -s "<url or file>" -e '//div[@class="some_div_class"]/strong/@content'
or
xidel -s "<url or file>" -e '//div[@class="some_div_class"]/normalize-space(strong)'
Upvotes: 2