sparkonhdfs
sparkonhdfs

Reputation: 1343

Use Regular Expressions to find URLs without certain word patterns

I am trying to write a Regular Expression that can match URLs that don't have a certain pattern. The URLs I am trying to filter out shouldn't have an ID in them, which is 40 Hex uppercase characters.

For example, If I have the following URLs:

/dev/api/appid/A1B2C3D4E5A1B2C3D4E5A1B2C3D4E5A1B2C3D4E5/users

/dev/api/apps/list

/dev/api/help/apps/applicationname/apple/osversion/list/

(urls are made up, but the idea is that there are some endpoints with 40-length IDs, and some endpoints that don't, and some endpoints that are really long in total characters)

I want to make sure that the regular expression is only able to match the last 2 URLs, and not the first one.

I wrote the following regex,

\S+(?:[0-9A-F]{40})\S+

and it matches endpoints that do have the long ID in them, but skips over the ones that should be filtered. If I try to negate the regex,

\S+(?![0-9A-F]{40})\S+

It matches all endpoints, because some URLs have lengths that are greater than what the ID should be (40 characters).

How can I use a regular expression to filter out exactly the URLs I need?

Upvotes: 0

Views: 630

Answers (2)

Gurmanjot Singh
Gurmanjot Singh

Reputation: 10360

Try this regex:

^(?!.*\/[0-9A-F]{40}\/).*$

Click for Demo

Explanation:

  • ^ - asserts the start of the string/url
  • (?!.*\/[0-9A-F]{40}\/) - Negative Lookahead to check for the presence of a / followed by exactly 40 HEX characters followed by / somewhere in the string. Since, it is a negative lookahead, any string/url containing this pattern will not be matched.
  • .* - matches 0+ occurrences of any character except a newline character
  • $ - asserts the end of the string

Upvotes: 1

Callum Watkins
Callum Watkins

Reputation: 2991

^((?![A-F0-9]{40}).)*$

Uses a negative lookahead to match any line that doesn't have 40 hex digits in a row. Try it here.

Upvotes: 1

Related Questions