grep only area in between two strings

Question

I'm having this problem where in trying to grep something on an html page (specifically a user name) I try to retrieve the string by saying:

egrep -o dir\=\"[ltr]*\"\>.*(\<\/span|\<\/a)

By this I am trying to say: "get anything after dir=("ltr or rlt")> and before the first or closing tag.

so for example:

dir="ltr">myusername

or

dir="rtl">myusername

There are however multiple span tags on one line, and it is not stopping after the first one, which results in data that I don't want.

Is there a way to modify my current regex to stop after the first one? And why does it even continue reading?

Thanks

tweak2 · Accepted Answer

You need to make the .* non-greedy by adding a ? to it.

egrep -o dir\=\"[ltr]*\"\>.*?(\<\/span|\<\/a)

A better solution is this (in raw regex, you will need to escape it):

dir="[ltr]{3}"[^>]*?>(.*?)(|)

Capture group 1 ($1) will contain what is between it, and capture group 2 ($2) will contain if its a span or a link termination.

Answers (2)