Reputation: 15340
I am currently using this regular expression in my C# / .NET Core app to parse HTTP, HTTPS & FTP urls from a markdown file:
static readonly Regex _urlRegex = new Regex(@"(((http|ftp|https):\/\/)+[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:\/~\+#]*[\w\-\@?^=%&\/~\+#])?)");
void UpdateGitHubReadme(string gitHubRepositoryName, string gitHubReadmeText)
{
var updatedMarkdown = _urlRegex.Replace(gitHubReadmeText, x => HandleRegex(x.Groups[0].Value, gitHubRepositoryName.Replace(".", "").Replace("-", "").ToLower(), "github", gitHubUser.Alias));
//handle updated markdown
}
static string HandleRegex(in string url, in string repositoryName, in string channel, in string alias)
{
//handle url
}
I am looking to update this regex to ignore URLs inside of markdown code blocks and markdown code snippets.
The following URL should be ignored because it is inside of a code block:
` ` `
{
"name": "Brandon",
"blog" : "https://codetraveler.io"
}
` ` `
The following URL should be ignored because it is inside of a code snippet:
`curl -I https://www.keycdn.com `
Upvotes: 3
Views: 376
Reputation: 626927
You can leverage your existing code that already has a match evaluator as the replacement argument in Regex.Replace
.
You need to add an alternative (with |
alternation operator) to the current regex that would match the contexts where you want to ignore matches, and then check which group matched.
The alternative you should add is (?<!`)(`(?:`{2})?)(?:(?!\1).)*?\1
, it matches
(?<!`)
- no backtick immediately to the left is allowed(`(?:`{2})?)
- Group 1: a backtick and then an optional double backtick sequence(?:(?!\1).)*?
- any char other than a line break char, zero or more occurrences but as few as possible, that does not start the same char sequence that is captured in Group 1\1
- the same char sequence that is captured in Group 1See the sample code:
static readonly Regex _urlRegex = new Regex(@"(?<!`)(`(?:`{2})?)(?:(?!\1).)*?\1|((?:ht|f)tps?://[\w-]+(?>\.[\w-]+)+(?:[\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?)", RegexOptions.Singleline);
void UpdateGitHubReadme(string gitHubRepositoryName, string gitHubReadmeText)
{
var updatedMarkdown = _urlRegex.Replace(gitHubReadmeText, x => x.Groups[2].Success ?
HandleRegex(x.Groups[0].Value, gitHubRepositoryName.Replace(".", "").Replace("-", "").ToLower(), "github", gitHubUser.Alias) : x.Value);
//handle updated markdown
}
I modified the URL pattern a bit to make it cleaner and more efficient.
Upvotes: 2