Dherik
Dherik

Reputation: 19100

Create a regex from a blacklist and whitelist regex expressions to identify and remove url parameters

I would like to identify and remove some parameters from a Url using a blacklist and whitelist. However, I would like to use regular expressions on the blacklist/whitelist and not a list of words. Every match on the blacklist regex will be removed except if the whitelist regex can allow.

This regex will be used on the replaceAll String method on Java. I almost found the solution, but I'm having some troubles to make work on general cases.

For example, if I have the list configured with the regular expressions:

The objective: remove param2 and not param1, because param1 is in the whitelist regex.

I create an expression putting the whitelist on the negative lookahead:

(?!(param1))(param1|param2)

And combine this expression with another regular expression to identify the Url query string delimiters:

(?<=[?&;])(?!(param1))(param1|param2)=.*?($|[&;])

The result is that only matches the param2:

https://www.so.com?param2=2&param1=1
https://www.so.com?param1=1
https://www.so.com?param1=1&param2=2
https://www.so.com?param3=3&param1=1&param2=2
https://www.so.com?param3=3&param2=2&param1=1

The Java code is something like:

url.replaceAll("(?<=[?&;])" + asNegativeLookahead(whitelist, blacklist) + "=.*?($|[&;])", "")
   .replaceAll("[?&;]$", "");

So far, so good.

But the problem happens when I used some more general regular expression on the blacklist, like .*:

This makes matches everything after the param1 when the first argument is param1, ignoring the whitelist regex.

I found a solution identifying each parameter with another regular expression and matching each group with whitelist and blacklist, but I'm not really confident about this code, because I need to manually recreate the Url with the parameters and still need to use negative lookahead, not really simplifying the solution.

Upvotes: 1

Views: 1465

Answers (1)

Ωmega
Ωmega

Reputation: 43683

I suggest to use this combination pattern:

([?&](?!.*&)|(?<=[?&;]))(?!(param1))(?=(param1|param2))([^&;=\n\r]*)=.*?($|[&;])

                whitelist ◄└──────┘    └─────────────┘► blacklist

See this demo.

Upvotes: 2

Related Questions