Sam
Sam

Reputation: 2347

grep only area in between two strings

I'm having this problem where in trying to grep something on an html page (specifically a user name) I try to retrieve the string by saying:

egrep -o dir\=\"[ltr]*\"\>.*(\<\/span|\<\/a)

By this I am trying to say: "get anything after dir=("ltr or rlt")> and before the first </a> or </span> closing tag.

so for example:

dir="ltr">myusername</span>

or

dir="rtl">myusername</a>

There are however multiple span tags on one line, and it is not stopping after the first one, which results in data that I don't want.

Is there a way to modify my current regex to stop after the first one? And why does it even continue reading?

Thanks

Upvotes: 1

Views: 1010

Answers (2)

Steve
Steve

Reputation: 54392

I would use GNU sed to do this:

sed -r 's/(dir="ltr"|dir="rtl")>([^<]+)(<\/span>|<\/a>).*/\2/' file.txt

You can make the regex a bit more clever and easier to read with some simplification:

sed -r 's/dir="(ltr|rtl)">([^<]+)<\/(span|a)>.*/\2/' file.txt

Upvotes: 0

tweak2
tweak2

Reputation: 656

You need to make the .* non-greedy by adding a ? to it.

egrep -o dir\=\"[ltr]*\"\>.*?(\<\/span|\<\/a)

A better solution is this (in raw regex, you will need to escape it):

dir="[ltr]{3}"[^>]*?>(.*?)(</span>|</a>)

Capture group 1 ($1) will contain what is between it, and capture group 2 ($2) will contain if its a span or a link termination.

See it in action: http://regexr.com?32b8k

Upvotes: 2

Related Questions