Martin
Martin

Reputation: 22760

Regex Select groups not found in a pattern

I have been looking at the various topics on Regex on SO, and they are all saying that to find the invert (select all that doesn't fit the criteria) you simply use the[^] syntax or negative lookahead.

I have tried using both of these methods on my Regex but the results are not adequate the [^] especially seems to take all its contents literally (even when escaped).

What I need this for:

I have a massive SQL line with a SQL dump I'm trying to remove all characters that are not the line id, and the numerical value of one column.

My regex works in matching exactly what I'm looking for; what I need to do is to invert this match so I can remove all non-matching parts in my IDE.

My regex:

/(\),\(\d{1,4},)|(,\d{10},)/

This matches a "),(<number upto 4 digits>," or ",<number of ten digits>," .

The subject

My subject is a 500Kb line of an SQL dump looking something like this (I have already removed a-z and other unwanted characters in previous simple find/replaces):

),(39,' ',1,'01761472100','@','9    ','20',1237213277,0,1237215419,''),(40,' ',3,'01445731203','@',' ','-','22 2','210410//816',1237225423,0,1484651768,''),(4270,' / 

My aim is to use a regex to achive the following output:

),(39,,1237213277,,1237215419,),(40,,1237225423,,1484651768,),(4270,

Which I can then go over again and easily remove repetitions such as commas.


I have read that Negation in Regex is tricky, So, what is the syntax to get the regex I've made to work inverted? To remove all non-matching groups? What can you recommend as a way of solving this without spending hours manually reading the lines?

Upvotes: 2

Views: 68

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626689

You may use a really helpful (*SKIP)(?!) (=(*SKIP)(*F) or (*SKIP)(*FAIL)) construct in PCRE to match these texts you know and then skip and match all other text to remove:

/(?:\),\(\d{1,4},|,\d{10},)(*SKIP)(?!)|./s

See the regex demo

Details:

  • (?:\),\(\d{1,4},|,\d{10},) - match 1 of the 2 alternatives:
    • \),\(\d{1,4}, - ),(, then 1 to 4 digits and then ,
    • | - or
    • ,\d{10}, - a comma, 10 digits, a comma
  • (*SKIP)(?!) - omit the matched text and proceed to the next match
  • | - or
  • . - any char (since /s DOTALL modifier is passed to the regex)

The same can be done with

/(\),\(\d{1,4},|,\d{10},)?./s

and replacing with $1 backreference (since we need to put back the text captured with the patterns we need to keep), see another regex demo.

Upvotes: 2

Related Questions