Rander
Rander

Reputation: 104

How can I effectively parse text into groups?

I have a set of pdf invoices that need to parsed into respectove columns/groups like below:

enter image description here

csv

To add text,

Data
28/06/22      7694759.0        Service Rear Lift 660L Cardboard                           1      $5.00            $5.00
30/06/22      7710113.0        Service Rear Lift 240L Co-Mingled           Bin Empty      1     $10.25           $10.25
28/06/22      7694756.0        Service Rear Lift 660L Putrescible                         2     $28.50           $57.00
30/06/22      7714583.0        Service Rear Lift 240L Putrescible                         1     $13.91           $13.91
Site: 111947.021 Rental Charge (H8.0 x 1 / RI) GRN (x 5)                              5       $10.00          $50.00
21/06/22     7648522.0   Futile Service Front Lift 1.5M mick-email  1   20  20
28/06/22     7686945.0   Removal Front Lift 4.5M Putrescible    full on removal 2   100 200
03/06/22     7537009.0  Service Rear Lift 240L Putrescible  JENNIFER    1   13.34   1.364

to be parsed into

Date            Site        Description                             Description_2   Qty Price   Total
28/06/22        7694759.0   Service Rear Lift 660L Cardboard                        1   5       5
30/06/22        7710113.0   Service Rear Lift 240L Co-Mingled       Bin Empty       1   10.25   10.25
28/06/22        7694756.0   Service Rear Lift 660L Putrescible                      2   28.5    57
30/06/22        7714583.0   Service Rear Lift 240L Putrescible                      1   13.91   13.91
                111947.021  Rental Charge (H8.0 x 1 / RI) GRN (x 5)                 5   10      50
21/06/22        7648522.0   Futile Service Front Lift 1.5M          mick-email      1   20      20
28/06/22        7686945.0   Removal Front Lift 4.5M Putrescible     full on removal 2   100     200
03/06/22        7537009.0   Service Rear Lift 240L Putrescible      JENNIFER        1   13.34   1.364

How can I perform the grouping in regex?

Upvotes: 1

Views: 81

Answers (1)

lemon
lemon

Reputation: 15492

You can use the following regex:

^([\d\/]+)?\s+(\d+\.\d+)\s+([A-Za-z]\S+(?:\s\S+)*)\s+([A-Za-z]\S+(?:\s\S+)*)?\s*([\d\.]+)\s+([\d\.]+)\s+([\d\.]+)$

Regex Explanation:

  • ^: start of string
  • ([\d\/]+)?: Group 1 (the date) - optional
  • \s+: spaces
  • (\d+\.\d+): Group 2 (the site)
  • \s+: spaces
  • ([A-Za-z]\S+(?:\s\S+)*): Group 3 (the description)
  • \s+: spaces
  • ([A-Za-z]\S+(?:\s\S+)*)?: Group 4 (the description 2) - optional
  • \s*: spaces - optional
  • ([\d\.]+): Group 5 (the quantity)
  • \s: spaces
  • ([\d\.]+): Group 6 (the price)
  • \s: spaces
  • ([\d\.]+): Group 7 (the total)
  • $: end of string

You can access the single groups by their group number.

Check the demo here.

Upvotes: 1

Related Questions