Haabda
Haabda

Reputation: 1433

Why does this RegEx work the way I want it to?

I have a RegEx that is working for me but I don't know WHY it is working for me. I'll explain.

RegEx: \s*<in.*="(<?.*?>)"\s*/>\s*


Text it finds (it finds the white-space before and after the input tag):

<td class="style9">
      <input name="guarantor4" id="guarantor4" size="50" type="text" tabindex="10" value="<?php echo $data[guarantor4]; ?>"  />    </td>
</tr>


The part I don't understand:

<in.*=" <--- As I understand it, this should only find up to the first =" as in it should only find <input name="

It actually finds: <input name="guarantor4" id="guarantor4" size="50" type="text" tabindex="10" value=" which happened to be what I was trying to do.

What am I not understanding about this RegEx?

Upvotes: 6

Views: 244

Answers (5)

regeXGuru4
regeXGuru4

Reputation: 21

Your greedy approach is causing confusion. You want .*? Consider the input 101000000000100.

Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001. .*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.

Upvotes: 2

Ross Patterson
Ross Patterson

Reputation: 9570

As I understand it, this should only find up to the first =" as in it should only find <input name="

You don't say what language you're writing in, but almost all regular expression systems are "greedy matchers" - that is, they match the longest possible substring of the input. In your case, that means everything everying from the start of the input tag to the last equal-quote sequence.

Most regex systems have a way to specify that the patter only match the shortest possible substring, not the longest - "non-greedy matching".

As an aside, don't assume the first parameter will be name= unless you have full control over the construction of the input. Both HTML and XML allow attributes to be specified in any order.

Upvotes: 3

Kent Fredric
Kent Fredric

Reputation: 57354

You appear to be using 'greedy' matching.

Greedy matching says "eat as much as possible to make this work"

try with

<in[^=]*=  

for starters, that will stop it matching the "=" as part of ".*"

but in future, you might want to read up on the

.*?  

and

.+?

notation, which stops at the first possible condtion that matches instead of the last.

The use of 'non-greedy' syntax would be better if you were trying to only stop when you saw TWO characters,

ie:

<in.*?=id

which would stop on the first '=id' regardless of whether or not there are '=' in between.

Upvotes: 8

Stavros Korokithakis
Stavros Korokithakis

Reputation: 4956

.* is greedy, so it'll find up to the last =. If you want it non-greedy, add a question mark, like so: .*?

Upvotes: 4

eyelidlessness
eyelidlessness

Reputation: 63529

.* is greedy. You want .*? to find up to only the first =.

Upvotes: 8

Related Questions