ceckenrode
ceckenrode

Reputation: 4703

Regex to match local markdown links

I'm trying to create a regex that will match a markdown urls, but ignore the content that comes before and after it. It should match only local markdown urls which point to local files and ignore ones that point to external websites. Example:

"dddd [link which should be ignore](http://google.com/) lorem ipsum lorem ips sum loreerm [link which shouldn't be ignored](../../../filepath/folder/some-other-folder/another-folder/one-last-folder/file-example.html). lorem ipsum lorem"

Should only match the second link. Currently, it matches everything. My regex works for what I need, but this seems to be the major edge case I've found.

What I have so far:

/(!?\[.*?\]\((?!.*?http)(?!.*?www\.)(?!.*?#)(?!.*?\.com)(?!.*?\.net)(?!.*?\.info)(?!.*?\.org).*?\))/g

Currently, this ignores the first link and matches the second link IF the second link doesn't come after the first link. Otherwise, it matches everything from the first to the second.

I'm using JavaScript, which doesn't support negative lookbehinds. Any suggestions?

Upvotes: 2

Views: 2732

Answers (2)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89614

Testing if an url is local or external is not a job for regex. As you can see with the third link in the example string, testing if the uri contains .org, .com, http, # or whatever is just wrong.

This code shows how to know if a url is local or not in a replacement context on client side:

var text = '[external link](http://adomain.com/path/file.txt) ' +
           '[local link](../path/page.html) ' +
           '[local link](../path.org/http/file.com.php#fragment)';

text = text.replace(/\[([^\]]*)\]\(([^)]*)\)/g, function (_, g1, g2) {
    var myurl = document.createElement('a');
    myurl.href = g2;
    return window.location.hostname == myurl.hostname ? "locrep" : "extrep"; 
});   

console.log(text);

Upvotes: 0

user557597
user557597

Reputation:

There are two problems.

  1. This \[.*?\] will blow past ] and match [link which should be ignore](http://google.com/) lorem ipsum lorem ips sum loreerm [link which shouldn't be ignored] just so it will match the assertions.
  2. The assertions are unbounded.

You can fix 1 & 2 with this regex

((!?\[[^\]]*?\])\((?:(?!http|www\.|\#|\.com|\.net|\.info|\.org).)*?\))

Expanded

 (                             # (1 start)
      ( !?\[ [^\]]*? \] )           # (2), Link
      \(                            # Open paren (
      (?:                           # Cluster
           (?!                           # Not any of these
                http
             |  www\.
             |  \# 
             |  \.com 
             |  \.net 
             |  \.info 
             |  \.org 
           )
           .                             # Ok, grab this character 
      )*?                           # End cluster, do 0 to many times
      \)                            # Close paren )
 )                             # (1 end)

Metrics

----------------------------------
 * Format Metrics
----------------------------------
Cluster Groups      =   1

Capture Groups      =   2

Assertions          =   1
       ( ? !        =   1

Free Comments       =   7
Character Classes   =   1

Upvotes: 2

Related Questions