Reputation: 2347
I'm having this problem where in trying to grep something on an html page (specifically a user name) I try to retrieve the string by saying:
egrep -o dir\=\"[ltr]*\"\>.*(\<\/span|\<\/a)
By this I am trying to say: "get anything after dir=("ltr or rlt")> and before the first </a>
or </span>
closing tag.
so for example:
dir="ltr">myusername</span>
or
dir="rtl">myusername</a>
There are however multiple span tags on one line, and it is not stopping after the first one, which results in data that I don't want.
Is there a way to modify my current regex to stop after the first one? And why does it even continue reading?
Thanks
Upvotes: 1
Views: 1010
Reputation: 54392
I would use GNU sed
to do this:
sed -r 's/(dir="ltr"|dir="rtl")>([^<]+)(<\/span>|<\/a>).*/\2/' file.txt
You can make the regex a bit more clever and easier to read with some simplification:
sed -r 's/dir="(ltr|rtl)">([^<]+)<\/(span|a)>.*/\2/' file.txt
Upvotes: 0
Reputation: 656
You need to make the .*
non-greedy by adding a ?
to it.
egrep -o dir\=\"[ltr]*\"\>.*?(\<\/span|\<\/a)
A better solution is this (in raw regex, you will need to escape it):
dir="[ltr]{3}"[^>]*?>(.*?)(</span>|</a>)
Capture group 1 ($1) will contain what is between it, and capture group 2 ($2) will contain if its a span or a link termination.
See it in action: http://regexr.com?32b8k
Upvotes: 2