Reputation: 593
I have an xml file with many lines like:
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
How do I extract just the link - http://store.vcenter.com/stores/en/product/tigers-midi/100
?
I tried http://www\.\.com[^<]+
but that captures everything untill the end of the line - including quotes and closing XML tags.
I'm using this expression with egrep.
Upvotes: 0
Views: 624
Reputation: 185284
Don't parse HTML with regex, use a proper XML/HTML parser.
Check: Using regular expressions with HTML tags You can use one of the following :
xmllint
xmlstarlet
saxon-lint
File:
<root>
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
</root>
Example with xmllint
:
xmllint --xpath '//*[@vip="true"]/@href' file.xml 2>/dev/null
Output:
href="http://store.vcenter.com/stores/en/product/tigers-midi/100"
If you need a quick & dirty one time command, you can do:
egrep -o 'https?://[^"]+' file
Upvotes: 2