Reputation: 4691
I am trying to split an HTML string into a Dictionary, where I keep the text, and what the HTML element was
For example, with this input
var input = "This is <b>bold</b> where as <i>this is italic</i>. This is the last sentence";
I would like the following output
{"This is ", "None"},
{"bold", "Bold"},
{" where as ", "None"},
{"this is italic", "italic"},
{". This is the last sentence", "None"},
I can share my effort, but it's fairly pointless as I can't get it to work, and my approach feels impossible to scale.
internal Dictionary<string, string> SplitTextByHtmlTags(string input)
{
var result = new Dictionary<string, string>();
var splitText = new List<string>();
var split = Split(input, "b");
foreach (var bold in split)
{
var italics = Split(bold, "i");
splitText.AddRange(italics);
}
foreach (var bold in splitText)
{
var underlines= Split(bold, "u");
splitText.AddRange(underlines);
}
return result;
}
private IEnumerable<string> Split(string input, string htmlEleName)
{
return input.Split("<"+htmlEleName+">").Select(s => s.Split("</"+htmlEleName+">")).ToList();
}
As I said, the above does not return the right value nor does it work.
Upvotes: 1
Views: 144
Reputation: 19641
Assuming the input text is always this simple (no nested tags, no attributes, comments, etc.), this is fairly easy to achieve using Regular Expression. Otherwise, I would stick to using an HTML parser.
Here's a full example:
var result = new List<(string text, string styling)>();
string input =
"This is <b>bold</b> where as <i>this is italic</i>. This is the last sentence";
var matches = Regex.Matches(input, @"[^<]+|<([bi])>([^<]+)</\1>");
foreach (Match match in matches)
{
// If neither `<b>` nor `<i>` was found.
if (!match.Groups[1].Success)
{
result.Add((match.Value, "None"));
}
else
{
string styling = (match.Groups[1].Value == "b" ? "Bold" : "Italic");
result.Add((match.Groups[2].Value, styling));
}
}
The example above creates a list of ValueTuple instead of a dictionary (which won't work in this case for reasons mentioned in the comments. The ValueTuple here has two string items. You might consider using an enum
instead of a string for the styling.
Explanation of the Regex pattern:
[^<]+
- Match one or more characters other than '<'.|
- Or.<([bi])>
- Match either 'b' or 'i' enclosed in angle brakets and capture the letter in group 1.([^<]+)
- Match one or more characters other than '<' and capture them in group 2.</\1>
- Match a closing HTML tag (i.e., </..>
) with the letter that was captured in group 1.If you need to support other HTML tags, replace [bi]
with something like (?:[biu]|div|span|etc)
in the pattern above (or simply use \w+
to support any arbitrary tag). Then, you can have a dictionary that returns the "nice name" for each tag name:
var tags = new Dictionary<string, string>()
{
{ "b", "Bold" },
{ "i", "Italic" },
{ "u", "Underline" },
};
Then, you can use it in the else
branch like this:
if (!tags.TryGetValue(match.Groups[1].Value, out string tag))
tag = match.Groups[1].Value;
result.Add((match.Groups[2].Value, tag));
Upvotes: 2
Reputation: 11
Try something like this:
internal Dictionary<string, string> SplitTextByHtmlTags(string input)
{
var result = new Dictionary<string, string>();
// Iterating through a string
for (var i = 0; i < input.Length; i++)
{
// Detecting the opening of the tag
if (input[i] == '<')
{
string
tag = "", // Name of the tag
content = ""; // Content of the tag
// Iterating over the tag
for (int j = i+1; j < input.Length; j++)
{
/**
* If alphabetic characters are being iterated over,
* then, most likely, this is the name of the tag.
*/
if (!input[j].IsLetter())
{
// As soon as any character that is not a letter occurs
for (int k = j; k < input.Length; k++)
{
// Looking for the end of the tag
if (input[k] == '>')
{
// Sorting through the contents of the tag
for (int l = k+1; l < input.Length; l++)
{
if (input[l] != '<')
{
content += input[l];
/*
* We move the "cursor" of the main loop
* to the place where the tag opening symbol was found.
*/
i = l;
// We put the found values in the map
result.Add(tag, content);
break;
}
}
break;
}
}
}
else tag += input[j];
}
}
}
return result;
}
Upvotes: 0