Vingtoft
Vingtoft

Reputation: 14616

How to construct regex to identify dollar ($) money sum

I'm trying to create a regex that can identify sum of money (in dollars). The problem is that the data is generated by OCR on scanned PDF files, so the data is not precise:

Examples:

Data:       What is should be:
S0.01    => $0.01
S1       => $1
S400.05  => $400.05
$0,01    => $0.01
S0,SI    => $0.51

Question: Is it possible to construct a regex that can search for such a complex pattern?

Upvotes: 1

Views: 275

Answers (1)

chepner
chepner

Reputation: 531345

It's not that complex. Start with a regular expression that can match "pristine" output, something like

\$[0-9]+(\.[0-9]{2})?

Now, just replace the questionable characters with their alternatives.

[$S][0-9SIl]+([.,][0-9SIl]{2})?

This can give you false positives, as you will "find" $1 in a sentence like "I read SI for baseball and basketball news" (SI being an abbreviation for the magazine Sports Illustrated, but that's unavoidable with regular expressions alone.

Once you've made the match, converting the result to its assumed correct form is simple: replace any initial S with $, any , with ., and any other S with 5.

Upvotes: 4

Related Questions