Kirill
Kirill

Reputation: 3757

How to exclude Regex matches from inside urls with JavaScript

My current regex looks for a searchQuery inside sentences and matches them if those queries start a with a blank space, and end with either a blank space or ?!,.. It generally works well, except for URLs. The regex ends up picking up urls and messing them up.

For example, if I was looking for "bitcoin" in a sentence "Bitcoin price is going nuts", it would find it, but it was also take the following url and match it. https://versionone.vc/the-solar-bitcoin-convergence, messing up the url.

How can I tell JavaScript Regex to ignore any matches where the character before the matching words is either of these / - . _ + ? This will essentially eliminated matches inside urls?

Current Regex: var reg = new RegExp('(\\b)${searchQuery}(\\s+|\\.|\\,|\\?|\\!', 'gi');

Replacement function: newString = oldString.replace(reg, substringReplacement);

substringReplacement(match) is a function that contains the logic of how to change the matching text.

Alternatively, what's another way to outright ignore urls from the searchable area. Thanks!

Upvotes: 0

Views: 1225

Answers (3)

Kirill
Kirill

Reputation: 3757

Although other comments there are more right, as far as Regex is concerned, since negative look ahead isn't supported by Safari, I have for not come up with a workaround. Instead of looking ahead and trying to negate the string, I can look forward and reject matches that are most likely to be a url.

${searchQuery}(?!-|\/|\.com) will skip a big fraction of urls, unless the searchQuery word is the last word in the url.

When I find the perfect answer, I will post it here.

Upvotes: 0

anubhava
anubhava

Reputation: 786091

In modern Javascript you can use dynamic length assertion in Javascript so you may try:

var reg = new RegExp('(?<!https?:\/\/\\S*)\\b${searchQuery}[\\s.,?!]', 'gi');

RegEx Demo

(?<!https?:\/\/\\S*) is negative lookbehind that will fail a match if http:// or https:// followed by 0 or more non-whitespace characters is found before the match.

Upvotes: 2

CertainPerformance
CertainPerformance

Reputation: 371168

I'd match the format of a URL or match the searchQuery pattern, then use a replacer function to check if the URL or the searchQuery was matched. In the case of the URL, replace with the URL (so that nothing gets replaced in such a case).

You'll also need to use backticks for a template literal if you want to use ${}-style interpolation.

// make this as elaborate as you want:
// https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url
var reg = new RegExp(`(https?:\/\/\S+)|(\\b)${searchQuery}\\s+|\\.|\\,|\\?|\\!`, 'gi');
newString = oldString.replace(reg, (match, g1) => g1 ? match : substringReplacement);

You also need to make sure the () groups are balanced (in your current code, they aren't, so the new RegExp call will currently throw a SyntaxError)

The substringReplacement isn't shown, but unless you're using the groups to replace, you can probably omit the capturing groups entirely, except for the URL section.

Upvotes: 1

Related Questions