Nicholas Elliott
Nicholas Elliott

Reputation: 339

regex remove all text but string

So I have a regex that matches to pull out data that I am looking for in text:

([A-Z]+A{5,})

This will select the code I am looking for in the following sample text:

Use these licenses with the VMware ESX build.

Feature               License Code                   Description
-------------------   ----------------------------   --------------------------------------------

CIFS                  CAYHXPKBFDUFZGABGAAAAAAAAAAA   CIFS protocol
FCP                   APTLYPKBFDUFZGABGAAAAAAAAAAA   Fibre Channel Protocol 

My desired end result is to do a replace on the document that will yield a text document that contains the text

CAYHXPKBFDUFZGABGAAAAAAAAAAA,APTLYPKBFDUFZGABGAAAAAAAAAAA

Upvotes: 1

Views: 115

Answers (1)

revo
revo

Reputation: 48751

You could add an alternation to your regex like this:

([A-Z]+A{5,})|\X

Then replace it with:

(?1$1,)

Replacement string means, if first capturing group is matched replace it with $1, otherwise replace it with nothing.

In comments I added a negative lookahead to avoid adding comma after a matched sub-string if found at the end. But an extra trailing comma is inevitable with this regex.


A more better approach:

(\b[A-Z]++\b(?<=A{5}))|\X

This uses a possessive quantifier and a lookbehind for ending As. You don't need to look for A{5,} but you only need to look for A{5}. Word boundaries could be removed if you want to match such strings even if found in middle of a longer word.

Upvotes: 3

Related Questions