Reputation: 3762
I process a lot of html and transform it into PDF files. Before I can transform my html I have to detect if any of the images are referenced files. If it is a referenced file then I base64 encode them and replace the src with it.
Right now I am relying on Regex to do the detection for me, but since I am using HtmlAgilityPack I was wondering if I can achieve the same with HtmlAgilityPack?
I would like to do this so I don't have to maintain the Regex when I am currently already using HtmlAgilityPack.
So right now I am detecting the data uri via RegEx with the following:
void Main()
{
var myHtml = @"<html><head></head><body><p><img src=''/></p></body></html>";
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(myHtml);
var imgs = htmlDoc.DocumentNode.SelectNodes("//img");
if (imgs != null && imgs.Count > 0)
{
foreach (var imgNode in imgs)
{
var srcAttribute = imgNode.Attributes.FirstOrDefault(a => string.Equals("src", a.Name, StringComparison.InvariantCultureIgnoreCase));
if (!string.IsNullOrEmpty(srcAttribute?.Value) && !StringIsDataUri(srcAttribute.Value))
{
Console.WriteLine("BASE ENCODE THE REFERENCED FILE");
}
}
}
}
//Regex from http://stackoverflow.com/a/5714355/1958344
private static Regex regex = new Regex(@"data:(?<mime>[\w/\-\.]+);(?<encoding>\w+),(?<data>.*)", RegexOptions.Compiled);
private bool StringIsDataUri(string stringToTest)
{
var match = regex.Match(stringToTest);
return match.Success;
}
Upvotes: 2
Views: 1609
Reputation: 89305
HtmlAgilityPack doesn't have built-in function to detect data URI, so you still need to incorporate your own implementation of such function.
As an aside, you can use LINQ API of HtmlAgilityPack to select img
element that have reference src
attribute in the first place :
var referenceImgs = htmlDoc.DocumentNode
.Descendants("img")
.Where(o => !StringIsDataUri(o.GetAttributeValue("src","")));
foreach(HtmlNode img in referenceImgs)
{
Console.WriteLine("BASE ENCODE THE REFERENCED FILE");
}
Upvotes: 3