How to split a string, keeping order and the reason for the split?

Question

I am trying to split an HTML string into a Dictionary, where I keep the text, and what the HTML element was

For example, with this input

var input = "This is bold where as this is italic. This is the last sentence";

I would like the following output

{"This is ", "None"},
{"bold", "Bold"},
{" where as ", "None"},
{"this is italic", "italic"},
{". This is the last sentence", "None"},

I can share my effort, but it's fairly pointless as I can't get it to work, and my approach feels impossible to scale.

internal Dictionary SplitTextByHtmlTags(string input)
{
    var result = new Dictionary();

    var splitText = new List();
    var split = Split(input, "b");

    foreach (var bold in split)
    {
        var italics = Split(bold, "i");
        splitText.AddRange(italics);
    }

    foreach (var bold in splitText)
    {
        var underlines= Split(bold, "u");
        splitText.AddRange(underlines);
    }

    return result;
}

private IEnumerable Split(string input, string htmlEleName)
{
    return input.Split("<"+htmlEleName+">").Select(s => s.Split("")).ToList();
}

As I said, the above does not return the right value nor does it work.

41686d6564 · Accepted Answer

Assuming the input text is always this simple (no nested tags, no attributes, comments, etc.), this is fairly easy to achieve using Regular Expression. Otherwise, I would stick to using an HTML parser.

Here's a full example:

var result = new List<(string text, string styling)>();

string input = 
    "This is bold where as this is italic. This is the last sentence";
var matches = Regex.Matches(input, @"[^<]+|<([bi])>([^<]+)");
foreach (Match match in matches)
{
    // If neither `` nor `` was found.
    if (!match.Groups[1].Success)
    {
        result.Add((match.Value, "None"));
    }
    else
    {
        string styling = (match.Groups[1].Value == "b" ? "Bold" : "Italic");
        result.Add((match.Groups[2].Value, styling));
    }
}

The example above creates a list of ValueTuple instead of a dictionary (which won't work in this case for reasons mentioned in the comments. The ValueTuple here has two string items. You might consider using an enum instead of a string for the styling.

Explanation of the Regex pattern:

[^<]+ - Match one or more characters other than '<'.

| - Or.

<([bi])> - Match either 'b' or 'i' enclosed in angle brakets and capture the letter in group 1.

([^<]+) - Match one or more characters other than '<' and capture them in group 2.

- Match a closing HTML tag (i.e., ) with the letter that was captured in group 1.

If you need to support other HTML tags, replace [bi] with something like (?:[biu]|div|span|etc) in the pattern above (or simply use \w+ to support any arbitrary tag). Then, you can have a dictionary that returns the "nice name" for each tag name:

var tags = new Dictionary() { { "b", "Bold" }, { "i", "Italic" }, { "u", "Underline" }, };

Then, you can use it in the else branch like this:

if (!tags.TryGetValue(match.Groups[1].Value, out string tag)) tag = match.Groups[1].Value; result.Add((match.Groups[2].Value, tag));

How to split a string, keeping order and the reason for the split?

Answers (2)

Related Questions