Th0rne
Th0rne

Reputation: 25

Using RegExp to filter receipt for prices

With my programm I want to be able to make a picture of any receipt and filter certain informations, one beeing the price.

My Input is the following:

Input:

----------
BT em <br/>
SCHWEINFURT _OSKAR-VON-MILLER-STR.6 <br/>
RADIESCHEN **0,59** <br/>
KAESEAUFSCH. **1.39** <br/>
BAUCHSPECK **1,19** <br/>
BAUCHSPECK **1,19** <br/>
DORNFELDER **0,99**<br/>
CLEMENTINEN **2,49**<br/>
L&M BLUE **3,50**<br/>
L&M BLUE **3,50**<br/>
SUMME EUR **14,84** *<br/>
BAR **50,00**<br/>

RUCKGELD EUR **35,16**<br/>
“ENTHALTENE MEHRWERTSTEUER A<br/>
MWST NETTO<br/>
**7,00** % **0,45** **6,40**<br/>
**19,00** % **1,28** **6,71**<br/>
SUMME MWST **1,73** **13,11**<br/>
EDEKA HANDELSGFSELLSCHAFT<br/>
NORDBAYERN-SACHSEN-THURINGEN MBH<br/>
STEUERNUMMER: 257/115/30471<br/>
QUITTUNG<br/>
NUTZEN SIE DIF EDECARD<br/>
PUNKTE_SAMMELN+PRAMIEN ERWERBEN<br/>
THR EINKAUF WARE UNS<br/>
1 BONUSPUNKTE WERT GEWESEN !<br/>
08.12.07 16:27 37589 48 4 8500<br/>
FS BEDIENTE STE: H. SEUFERT :<br/>
VIELEN DANK FÜR IHREN EINKAUF!<br/>
AUF WIEDERSEHEN IM E-CENTER<br/>
UNSERE ÖFFNUNGSZEITEN FÜR SIE:<br/>
MONTAG-SAMSTAG: 0800-20 . 00UER<br/>

The informations I want to obtain are bold.

Tried RegExp:

First I tried the following RegExp:

/(([\d]{1,2})(\,|\.)[\d]{2})/g

I choose this one, because

I got this output

As you can see part of the date is a match, which I don't want. Right now I don't mind the part after MWST Netto to match.

My approach to the problem

My idea was to look for the dot. So I tried adding [^.] before and after my RegExp

Then I got this output

As you can see my problem is still there. I don't understand why 6,40 and 6,71 is not a match anymore as there is no dot before or after.

Does anyone got an idea what to try next? I was thinking about and AND-Statement, so I would use my first RegExp and then exclude anything that looks like a date. But I'm not sure how to that.

I would really appreciate any tipps or ideas you have. If there is anything unclear or you need more information, please do not hesitate to ask.

Upvotes: 1

Views: 244

Answers (2)

The fourth bird
The fourth bird

Reputation: 163362

One way could be to use an alternation to match the format that you don't want and then capture in a group what you do want:

\d+\.\d+\.\d+|(\d{1,2}[.,]\d{1,2})

Explanation

  • \d+\.\d+\.\d+ Match pattern that you don't want to capture (or for example \d{2}\.\d{2}\.\d{2} if you want to be more specific)
  • | Or
  • (\d{1,2}[.,]\d{2}) Capture in a group 1 or 2 digits, a comma or dot and then 2 digits

Regex demo

const regex = /\d+\.\d+\.\d+|(\d{1,2}[.,]\d{2})/g;
const str = `BT em
SCHWEINFURT _OSKAR-VON-MILLER-STR.6
RADIESCHEN 0,59
KAESEAUFSCH. 1.39
BAUCHSPECK 1,19
BAUCHSPECK 1,19
DORNFELDER 0,99
CLEMENTINEN 2,49
L&M BLUE 3,50
L&M BLUE 3,50
SUMME EUR 14,84 *
BAR 50,00

RUCKGELD EUR 35,16
“ENTHALTENE MEHRWERTSTEUER A
MWST NETTO
7,00 % 0,45 6,40
19,00 % 1,28 6,71
SUMME MWST 1,73 13,11
EDEKA HANDELSGFSELLSCHAFT
NORDBAYERN-SACHSEN-THURINGEN MBH
STEUERNUMMER: 257/115/30471
QUITTUNG
NUTZEN SIE DIF EDECARD
PUNKTE_SAMMELN+PRAMIEN ERWERBEN
THR EINKAUF WARE UNS
1 BONUSPUNKTE WERT GEWESEN !
08.12.07 16:27 37589 48 4 8500
FS BEDIENTE STE: H. SEUFERT :
VIELEN DANK FÜR IHREN EINKAUF!
AUF WIEDERSEHEN IM E-CENTER
UNSERE ÖFFNUNGSZEITEN FÜR SIE:
MONTAG-SAMSTAG: 0800-20 . 00UER`;
let m;

while ((m = regex.exec(str)) !== null) {
  if (m.index === regex.lastIndex) {
    regex.lastIndex++;
  }
  if (m[1]) {
    console.log(m[1]);
  }
}

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626893

You may use

/(?:^|[^.\d])(\d{1,2}[,.]\d{2})(?![.\d])/g

and grab the contents of Group 1. See the regex demo.

Details

  • (?:^|[^.\d]) - start of string or any char other than . and digit
  • (\d{1,2}[,.]\d{2}) - Group 1: 1 or 2 digits, . or ,, two digits
  • (?![.\d]) - no . or digit immediately to the right is allowed.

JS demo:

var text = "BT em \r\nSCHWEINFURT _OSKAR-VON-MILLER-STR.6 \r\nRADIESCHEN 0,59 \r\nKAESEAUFSCH. 1.39 \r\nBAUCHSPECK 1,19 \r\nBAUCHSPECK 1,19 \r\nDORNFELDER 0,99\r\nCLEMENTINEN 2,49\r\nL&M BLUE 3,50\r\nL&M BLUE 3,50\r\nSUMME EUR 14,84 *\r\nBAR 50,00\r\n\r\nRUCKGELD EUR 35,16\r\n“ENTHALTENE MEHRWERTSTEUER A\r\nMWST NETTO\r\n7,00 % 0,45 6,40\r\n19,00 % 1,28 6,71\r\nSUMME MWST 1,73 13,11\r\nEDEKA HANDELSGFSELLSCHAFT\r\nNORDBAYERN-SACHSEN-THURINGEN MBH\r\nSTEUERNUMMER: 257/115/30471\r\nQUITTUNG\r\nNUTZEN SIE DIF EDECARD\r\nPUNKTE_SAMMELN+PRAMIEN ERWERBEN\r\nTHR EINKAUF WARE UNS\r\n1 BONUSPUNKTE WERT GEWESEN !\r\n08.12.07 16:27 37589 48 4 8500\r\nFS BEDIENTE STE: H. SEUFERT :\r\nVIELEN DANK FÜR IHREN EINKAUF!\r\nAUF WIEDERSEHEN IM E-CENTER\r\nUNSERE ÖFFNUNGSZEITEN FÜR SIE:\r\nMONTAG-SAMSTAG: 0800-20 . 00UER";
var rx = /(?:^|[^.\d])(\d{1,2}[,.]\d{2})(?![.\d])/g;
var m, res = [];
while (m = rx.exec(text)) {
  res.push(m[1]);
}
console.log(res);

Upvotes: 1

Related Questions