Reputation: 559
I need to extract data from an HTML document and compose an XML document with only interesting information. The way I'm doing this is by transforming the HTML doc into an XML doc, step by step. I have the 5 outermost XML tags in one line each, now I'm trying to structure what's inside of those.
I have a line that's structured this way :
<myTag>
blablabla <a href="link/I/want" *some css* > title I want </a> some other stuff <a href="link that/I/don't/want" *some css*> text I don't want </a> blablabla
</myTag>
What I want is :
<myTag>
<link>link/I/want</link>
<title> title I want </title>
</myTag>
The regex I have is :
/a href="(.*)"(.*)>(.*)<\/a>/
hoping to get #$1 = url , $2 = whatever , $3 = title.
This isn't working because it's taking this instead:
<myTag>
<link>link/I/want *some css* > title I want </a> some other stuff <a href="link that/I/don't/want" *some css*</link>
<titl>text I don't want</title>
</myTag>
How do I extract the content of the FIRST anchor tag of the line ?
Thanks !
Upvotes: 1
Views: 168
Reputation: 64623
Just use non-greedy expressions:
/a href="(.*?)"(.*?)>(.*?)<\/a>/
Note ?
after each *
.
Upvotes: 3