MyDaftQuestions
MyDaftQuestions

Reputation: 4691

How to split a string, keeping order and the reason for the split?

I am trying to split an HTML string into a Dictionary, where I keep the text, and what the HTML element was

For example, with this input

var input = "This is <b>bold</b> where as <i>this is italic</i>. This is the last sentence";

I would like the following output

{"This is ", "None"},
{"bold", "Bold"},
{" where as ", "None"},
{"this is italic", "italic"},
{". This is the last sentence", "None"},

I can share my effort, but it's fairly pointless as I can't get it to work, and my approach feels impossible to scale.

internal Dictionary<string, string> SplitTextByHtmlTags(string input)
{
    var result = new Dictionary<string, string>();

    var splitText = new List<string>();
    var split = Split(input, "b");

    foreach (var bold in split)
    {
        var italics = Split(bold, "i");
        splitText.AddRange(italics);
    }

    foreach (var bold in splitText)
    {
        var underlines= Split(bold, "u");
        splitText.AddRange(underlines);
    }

    return result;
}

private IEnumerable<string> Split(string input, string htmlEleName)
{
    return input.Split("<"+htmlEleName+">").Select(s => s.Split("</"+htmlEleName+">")).ToList();
}

As I said, the above does not return the right value nor does it work.

Upvotes: 1

Views: 144

Answers (2)

41686d6564
41686d6564

Reputation: 19641

Assuming the input text is always this simple (no nested tags, no attributes, comments, etc.), this is fairly easy to achieve using Regular Expression. Otherwise, I would stick to using an HTML parser.

Here's a full example:

var result = new List<(string text, string styling)>();

string input = 
    "This is <b>bold</b> where as <i>this is italic</i>. This is the last sentence";
var matches = Regex.Matches(input, @"[^<]+|<([bi])>([^<]+)</\1>");
foreach (Match match in matches)
{
    // If neither `<b>` nor `<i>` was found.
    if (!match.Groups[1].Success)
    {
        result.Add((match.Value, "None"));
    }
    else
    {
        string styling = (match.Groups[1].Value == "b" ? "Bold" : "Italic");
        result.Add((match.Groups[2].Value, styling));
    }
}

The example above creates a list of ValueTuple instead of a dictionary (which won't work in this case for reasons mentioned in the comments. The ValueTuple here has two string items. You might consider using an enum instead of a string for the styling.

Explanation of the Regex pattern:

  • [^<]+ - Match one or more characters other than '<'.
  • | - Or.
  • <([bi])> - Match either 'b' or 'i' enclosed in angle brakets and capture the letter in group 1.
  • ([^<]+) - Match one or more characters other than '<' and capture them in group 2.
  • </\1> - Match a closing HTML tag (i.e., </..>) with the letter that was captured in group 1.

If you need to support other HTML tags, replace [bi] with something like (?:[biu]|div|span|etc) in the pattern above (or simply use \w+ to support any arbitrary tag). Then, you can have a dictionary that returns the "nice name" for each tag name:

var tags = new Dictionary<string, string>()
{
    { "b", "Bold" },
    { "i", "Italic" },
    { "u", "Underline" },
};

Then, you can use it in the else branch like this:

if (!tags.TryGetValue(match.Groups[1].Value, out string tag))
    tag = match.Groups[1].Value;
result.Add((match.Groups[2].Value, tag));

Upvotes: 2

Ertanic
Ertanic

Reputation: 11

Try something like this:

internal Dictionary<string, string> SplitTextByHtmlTags(string input)
{
    var result = new Dictionary<string, string>();

    //  Iterating through a string
    for (var i = 0; i < input.Length; i++)
    {
        //  Detecting the opening of the tag
        if (input[i] == '<')
        {
            string 
                tag = "",       //  Name of the tag
                content = "";   //  Content of the tag

            //  Iterating over the tag
            for (int j = i+1; j < input.Length; j++)
            {
                /**
                    * If alphabetic characters are being iterated over,
                    * then, most likely, this is the name of the tag.
                    */
                if (!input[j].IsLetter())
                {
                    //  As soon as any character that is not a letter occurs
                    for (int k = j; k < input.Length; k++)
                    {
                        //  Looking for the end of the tag
                        if (input[k] == '>')
                        {
                            //  Sorting through the contents of the tag
                            for (int l = k+1; l < input.Length; l++)
                            {
                                if (input[l] != '<')
                                {
                                    content += input[l];

                                    /*
                                    * We move the "cursor" of the main loop
                                    * to the place where the tag opening symbol was found.
                                    */
                                    i = l;

                                    //  We put the found values in the map
                                    result.Add(tag, content);

                                    break;
                                }
                            }
                            break;
                        }
                    }
                }
                else tag += input[j];
            }
        }
    }

    return result;
}

Upvotes: 0

Related Questions