Reputation: 605
I am trying to write a regular expression in Perl that works on text files that are a mix of text and account numbers. What I would like to do is reformat the account numbers. I am running across an issue with using .*
to match on either side of the account numbers when there is more than one match on a given line. I have done some searching and couldn't find any answers so I am hoping that someone could explain to me what is wrong with my regex so I can avoid this pitfall in the future.
while(<>) {
s/(.*)\b([0-9]+)\b(.*)/$1xxx\-$2$3/g;
print;
}
The xxx-
are going to be replaced by account identifiers but till I get it working I have just x
's.
The issue I am having is that only the last occurrence get's replaced not all occurrences.
For example with a simple sample line:
First Part 223456 Third Part Fourth Part 113456 Fifth Part Sixth Part
I would expect:
First Part xxx-223456 Third Part Fourth Part xxx-113456 Fifth Part Sixth Part
But I only get:
First Part 223456 Third Part Fourth Part xxx-113456 Fifth Part Sixth Part
I have it narrowed down to the .*
as the issue if I include other metacharacters in the capturing group it works but I have no guarantees as to what's in the files so I need to match everything. It only happens when there are multiple account numbers on the same line; if the account numbers appear on multiple lines it works fine.
Any feedback would be greatly appreciated
Upvotes: 0
Views: 121
Reputation: 29854
(.*)
consumes all the characters in the input, it then has to start backtracking: giving one character back and testing if the next pattern matches and if not, it gives back another character and checks the match, character by character.
So by putting a greedy universal match as your first expression, you're actually asking the engine to only find the last match. You might not have known you were asking for this, but you were.
Generally, when processing regex, you have to think about the data: "How would I identify this pattern in a file. Very likely, "one-or-more digits" just doesn't cut it for an account number, so specify the pattern you want to match to the best of your ability to specify it. Then you can be sure that if something matches your pattern, it is likely what you want. By the way, the word boundary specification was a good start.
If you need exactly six digits, then specify exactly six digits.
Another reason that you should not have to specify (.*)
as part of a match is that, from the looks of it, you are doing what you think you need to do to keep the other parts of the line in their place. But, Perl only replaces the matched section with the replacement. You never need to specify anything but the part you want matched.
So, assuming that your account numbers are 6 digits wide, this is all you need.
s/\b(\d{6})\b/xxx-$1/g;
One last point. If for some reason your regex would have found you the first match, specifying (.*)
after the pattern, guaranteed that you you only find one match per line, and the /g
would not apply, because it makes the full match equal to the line of input.
Upvotes: 2
Reputation: 4088
The problem I see is greedy matching (.*) which in your case will match everything up until the last ([0-9]word boundary)
. I think you can just turn this off and you should be fine(eg. s/(.*?)//g)
.
Here's a small example:
while(my $line = <$fh>) {
$line =~ s/(.*?)\b([0-9]+)\b(.*?)/$1xxx\-$2$3/g;
print $line;
}
OUTPUT:
First Part xxx-223456 Third Part Fourth Part xxx-113456 Fifth Part Sixth Part
First Part xxx-223456 Third Part Fourth Part xxx-113456 Fifth Part Sixth Part
First Part xxx-223456 Third Part Fourth Part
First Part xxx-223456
Upvotes: 1
Reputation: 36272
One way using negative look-behind and another positive look-ahead:
perl -pe 's/(?<!\d)(\d+)(?=\D|$)/xxx-$1/g' <<<"First Part 223456 Third Part Fourth Part 113456 Fifth Part Sixth Part"
It yields:
First Part xxx-223456 Third Part Fourth Part xxx-113456 Fifth Part Sixth Part
Upvotes: 0
Reputation: 1005
If account numbers are just going to be numbers, just do this:
s/\b(\d+)\b/xxx-$1/g;
And if they will always be 6 numbers, be more specific: s/\b(\d{6})\b/xxx-$1/g;
Upvotes: 2