Reputation: 10551
I want to parse receipts to find their total sum. The receipts come in as HTML. But receipts come from different companies, so have different HTML structures. E.g. it can be (those are real life examples):
Company 1: Inside span
<span style="font-size: small;">
Amount: $17.85USD Status: Paid<br>Transaction #: 1<br>
</span>
Company 2: one tr in a table
<tr>
<td style="height:17px; color:black;"><strong>Total</strong></td>
<td style="text-align:right; color:black;"><strong>15.90€</strong></td>
</tr>
Company 3: table inside td from outer table
<td style="border: 0;border-collapse: collapse;margin: 0;padding: 0;-webkit-font-smoothing: antialiased;-moz-osx-font-smoothing: grayscale;width: 472px;">
<table style="border: 0;border-collapse: collapse;margin: 0;padding: 0;width: 100%;">
<tbody>
<tr>
<td style="border: 0;border-collapse: collapse;margin: 0;padding: 0;-webkit-font-smoothing: antialiased;-moz-osx-font-smoothing: grayscale;" valign="top">
<table style="border: 0;border-collapse: collapse;margin: 0;padding: 0;">
<tbody>
<tr>
<td style="border: 0;border-collapse: collapse;margin: 0;padding: 0;-webkit-font-smoothing: antialiased;-moz-osx-font-smoothing: grayscale;font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Ubuntu, sans-serif;mso-line-height-rule: exactly;vertical-align: middle;color: #8898aa;font-size: 12px;line-height: 16px;white-space: nowrap;font-weight: bold;text-transform: uppercase;">
Amount paid
</td>
</tr>
<tr>
<td style="border: 0;border-collapse: collapse;margin: 0;padding: 0;-webkit-font-smoothing: antialiased;-moz-osx-font-smoothing: grayscale;font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, 'Helvetica Neue', Ubuntu, sans-serif;mso-line-height-rule: exactly;vertical-align: middle;color: #525f7f;font-size: 15px;line-height: 24px;white-space: nowrap;">
£4.50
</td>
</tr>
</tbody>
</table>
</td>
How would you create an algorithm that would be the most future-proof to parse new receipts from other companies:
Edit: in the HTML, I only show the total sum, but there can be multiple fields with money amounts. I only want to fetch the total sum. Note that it's not always the largest sum on the page, as there might be discounts.
Upvotes: 0
Views: 31
Reputation: 11
Use regular expressions to look for money amounts, and then keep the hiegest of the receipt (as it should be the total amount).
for example, ^\$( )*\d*(.\d{1,2})?$
regex can find $ amounts.
What language do you use? in php, preg_match()
is the function you'll need :)
Upvotes: 1