Test45
Test45

Reputation: 593

Regex to extract http links from an XML file

I have an xml file with many lines like:

<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />

How do I extract just the link - http://store.vcenter.com/stores/en/product/tigers-midi/100?

I tried http://www\.\.com[^<]+ but that captures everything untill the end of the line - including quotes and closing XML tags.

I'm using this expression with egrep.

Upvotes: 0

Views: 624

Answers (1)

Gilles Qu&#233;not
Gilles Qu&#233;not

Reputation: 185284

Don't parse HTML with , use a proper XML/HTML parser.

Check: Using regular expressions with HTML tags You can use one of the following :

  • xmllint
  • xmlstarlet
  • saxon-lint

File:

<root>
<xhtml:link vip="true" href="http://store.vcenter.com/stores/en/product/tigers-midi/100" />
</root>

Example with xmllint :

xmllint --xpath '//*[@vip="true"]/@href' file.xml 2>/dev/null

Output:

 href="http://store.vcenter.com/stores/en/product/tigers-midi/100"

If you need a quick & dirty one time command, you can do:

egrep -o 'https?://[^"]+' file

Upvotes: 2

Related Questions