Cavyn VonDeylen
Cavyn VonDeylen

Reputation: 4239

Backreference before capture group

I'm trying to match the text Page x of x so I can identify the last page in a document.

I've been playing around with capture groups, and found the regex Page (\d*) of \1 almost works, except that it also matches things such as Page 2 of 25. Ideally, I'd like to use Page \1 of (\d*), but I guess the regex engine doesn't support that.

I should also note that this is part of an OCR job, so I can't rely on string endings, since occasionally I pick up extra characters (Page 2 of 25la, for example)

Anyone have any tips?

Upvotes: 1

Views: 227

Answers (3)

Bohemian
Bohemian

Reputation: 424983

Add a look ahead:

Page (\d*) of \1(?=\D|\Z)

The look ahead will match when the input following the back reference is a "non digit" character or end of input.

Upvotes: 1

wumpz
wumpz

Reputation: 9131

But instead of a extra character like a at the end you could get an extra digit. And then you could be at the last page of your doc but the regexpr does not match.

Maybe the best way to attack this problem is to start with the simple regexp

Page\s+(\d+)\s+of\s+(\d+)

Regular expression visualization

Debuggex Demo

and iterate over all occurances to somehow overcome this nasty extra character problem and get the max page number right. And after it is clear how many pages there are, then to check where group 1 equals group 2.

I included \s+ in my regexp. This should also be necessary due to your data.

But in the end there is only a chance that it will work depending on the accuracy of the OCR processing.

Upvotes: 1

Sabuj Hassan
Sabuj Hassan

Reputation: 39355

Use \d+ instead of \d*. Also check for the end of digit using lookaround as well.

Page (\d+) of \1(?=\D)

Upvotes: 2

Related Questions