Brian Bruman
Brian Bruman

Reputation: 913

Match Contents Between Malformed Numbered List (1., 2., 3., etc), Regex

Dealing with this crazy string that is a conversion from a PDF to text framework.

I'll post it at the end but it is probably easier to decipher here: https://regex101.com/r/DxXupz/1

I figured out how to match the contents between 1. and 2. using this regex:

1\.(.*?)2\.

But as you can see the $string I'm dealing with has all sorts of numerics and decimals and the like, and goes all the way up to 11.

Is there a regex solution to capture all the numbered lists in one preg_match_all function query, ie (example with regex above for 1. to 2.):

preg_match_all('/1\.(.*?)2\./s', $string, $matches);

To bring back the contents from 1. to 2., 2. to 3., and so forth?

$string = "1. CZ243 96V DC   

20
0pcs  


11.35U
SD            220
.
00
USD


2
”

,74mm/s 


25lbs .

2.

CV243 96V DC  

10
0pcs  


11.35USD            1135
.00
USD  


4
”

,74mm/s


25lbs

3
. CV243 96V DC   

150pcs         12.20
U
SD           1830.00
USD


6
”

,74mm/s   


25lbs .

4. CV243 96V DC  

100
pcs        13.50
1USD            1350.00
USD


8
”

,74mm/s 


25lbs .

5
. CV243 96V DC 

50
pcs    

15.00USD     

750.00
USD


10
”

,74mm/s 


25lbs .

6. CV243 96V DC   

200pcs 

15.00USD    

3000.00
USD


12
”

,74mm/s 


25lbs .

7
. CV243 96V DC  


50pcs 


16.00USD           800.00
USD


14
”

,74mm/s 


25lbs .

8. CV243 96V DC   

75pcs         16.50
USD



1237.50
USD


16
”

,74mm/s 


25lbs .

9. CV243 96V DC               
5
0pcs 


18.46USD           
923.00
USD


18
”

,74mm/s 


25lbs .


10.CV243 96V DC               
50pcs 


18.46USD 

923.00
USD


20
”

,74mm/s 


25lbs .


11. 
CV243 96V DC               
5
0pcs 


20.77USD           1038.50
USD


24
”

,74mm/s 


25lbs .


";

Upvotes: 0

Views: 61

Answers (1)

Nick
Nick

Reputation: 147206

This regex should give you the results you want:

\d+\s*\.\s*(CV243 96V DC.*?)(?=\d+\s*\.\s*CV243 96V DC|$)

It looks for some digits, followed optionally by whitespace, a period, some possible whitespace and the string CV243 96V DC. It then grabs all the characters up to the next occurrence of the starting pattern or the end of the string (asserted using a positive lookahead so the characters are not captured in that match). In PHP:

preg_match_all('/\d+\s*\.\s*(CV243 96V DC.*?)(?=\d+\s*\.\s*CV243 96V DC|$)/s', $string, $matches);
print_r($matches[1]);

The output is somewhat messy so I won't repeat it all here but you can see this in operation in this demo. Here are the first two values:

[0] => CV243 96V DC 20 0pcs 11.35U SD 220 . 00 USD 2 ” ,74mm/s 25lbs . 
[1] => CV243 96V DC 10 0pcs 11.35USD 1135 .00 USD 4 ” ,74mm/s 25lbs 

Note

I've assumed your data is supposed to start with 1. CV243, not 1. CZ243. If it supposed to start with 1. CZ243 and you still want to capture that, change the CV243 in the regex to C[VZ]243.

Upvotes: 1

Related Questions