Reputation: 14616
I'm trying to create a regex that can identify sum of money (in dollars). The problem is that the data is generated by OCR on scanned PDF files, so the data is not precise:
$
can be represented by S
.
can be represented by ,
1
can be represented by l
or I
5
can be represented by S
Examples:
Data: What is should be:
S0.01 => $0.01
S1 => $1
S400.05 => $400.05
$0,01 => $0.01
S0,SI => $0.51
Question: Is it possible to construct a regex that can search for such a complex pattern?
Upvotes: 1
Views: 275
Reputation: 531345
It's not that complex. Start with a regular expression that can match "pristine" output, something like
\$[0-9]+(\.[0-9]{2})?
Now, just replace the questionable characters with their alternatives.
[$S][0-9SIl]+([.,][0-9SIl]{2})?
This can give you false positives, as you will "find" $1
in a sentence like "I read SI for baseball and basketball news" (SI being an abbreviation for the magazine Sports Illustrated, but that's unavoidable with regular expressions alone.
Once you've made the match, converting the result to its assumed correct form is simple: replace any initial S
with $
, any ,
with .
, and any other S
with 5
.
Upvotes: 4