Brandon Minnick
Brandon Minnick

Reputation: 15340

Parse URLs using Regex, Ignoring Code Blocks and Code Snippets in Markdown

I am currently using this regular expression in my C# / .NET Core app to parse HTTP, HTTPS & FTP urls from a markdown file:

static readonly Regex _urlRegex = new Regex(@"(((http|ftp|https):\/\/)+[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:\/~\+#]*[\w\-\@?^=%&\/~\+#])?)");

void UpdateGitHubReadme(string gitHubRepositoryName, string gitHubReadmeText)
{
    var updatedMarkdown = _urlRegex.Replace(gitHubReadmeText, x => HandleRegex(x.Groups[0].Value, gitHubRepositoryName.Replace(".", "").Replace("-", "").ToLower(), "github", gitHubUser.Alias));

    //handle updated markdown
}

static string HandleRegex(in string url, in string repositoryName, in string channel, in string alias)
{
    //handle url
}

I am looking to update this regex to ignore URLs inside of markdown code blocks and markdown code snippets.

Example 1

The following URL should be ignored because it is inside of a code block:

` ` `
{ "name": "Brandon", "blog" : "https://codetraveler.io" }

` ` `

Example 2

The following URL should be ignored because it is inside of a code snippet:

`curl -I https://www.keycdn.com `

Upvotes: 3

Views: 376

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626927

You can leverage your existing code that already has a match evaluator as the replacement argument in Regex.Replace.

You need to add an alternative (with | alternation operator) to the current regex that would match the contexts where you want to ignore matches, and then check which group matched.

The alternative you should add is (?<!`)(`(?:`{2})?)(?:(?!\1).)*?\1, it matches

  • (?<!`) - no backtick immediately to the left is allowed
  • (`(?:`{2})?) - Group 1: a backtick and then an optional double backtick sequence
  • (?:(?!\1).)*? - any char other than a line break char, zero or more occurrences but as few as possible, that does not start the same char sequence that is captured in Group 1
  • \1 - the same char sequence that is captured in Group 1

See the sample code:

static readonly Regex _urlRegex = new Regex(@"(?<!`)(`(?:`{2})?)(?:(?!\1).)*?\1|((?:ht|f)tps?://[\w-]+(?>\.[\w-]+)+(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?)", RegexOptions.Singleline);

void UpdateGitHubReadme(string gitHubRepositoryName, string gitHubReadmeText)
{
    var updatedMarkdown = _urlRegex.Replace(gitHubReadmeText, x => x.Groups[2].Success ?
         HandleRegex(x.Groups[0].Value, gitHubRepositoryName.Replace(".", "").Replace("-", "").ToLower(), "github", gitHubUser.Alias) : x.Value);

    //handle updated markdown
}

I modified the URL pattern a bit to make it cleaner and more efficient.

Upvotes: 2

Related Questions