Andre D
Andre D

Reputation: 73

Complex Text Parsing - Please Help Figure Out

I'm rather poor in algorithm design and have a complex problem - please take a look. I'm currently working in Java/Groovy.

I've got some text that looks like this:

AAAAA  
AAAAA
CCCCC
any stuff here  
111  
any stuff here  
AAAAA  
stuff  
AAAAA  
stuff  
AAAAA  

BBBBB  
stuff  
222  
stuff  
BBBBB   

My challenge is to grab all the strings that are in the format of AAAAA stuff 111 stuff AAAAA, without grabbing any surrounding text. You can see that there are multiple AAAAA in the string, but I must only grab the ones closest to the 111s and 222s, and then do this for all strings of this type.

My regular expressions (not working) look like this:

/(\w{8}|\w{11}).*?(\w{3}).*?\1/  

I've been playing around with a bunch of them and they either grab too much text or perform too slowly... if anyone has an idea of what I should be using for this type of problem, please let me know.

Edit: These are what I am trying to match:

AAAAA
CCCCC
any stuff here  
111  
any stuff here  
AAAAA  

and

BBBBB  
stuff  
222  
stuff  
BBBBB  

I'd say this is pretty much like parsing improperly tagged XML. Anyway, thanks for looking.

Upvotes: 1

Views: 136

Answers (1)

Ωmega
Ωmega

Reputation: 43703

Use regex pattern

(?s)\b(\w{5})\b(?:(?!\1).)*?\b\w{3}\b(?:(?!\1).)*?\1

Upvotes: 2

Related Questions