File with simple text mixed with html - split on parts

Question

I need to parse some text from some files and split ti by parts, depends on is it simple text part of text or html.

Let's say, this is example text

This section should include any considerations for:


    C
    B
    A


h1. Support Contracts

simple par

And it should be splitted like that (used JSON notation, because of it was fast to write, doesnt matter what type of the container is there)

 [{
     part: 1,
     text: "This section should include any considerations for:" 
    }, 
    {
     part: 2,
     text:" C
B
 A"
    },
    {
     part: 3,
     text:"h1. Support Contracts"
    },
    {
     part: 4,
     text:"simple par"
    }]

Html there is really simple and all tags are guaranteed closed (it generated by program)

What the way is most faster (without using any third-party libs)? Can I use regex here for this task?

pstrjds · Accepted Answer

If I am understanding your requirements properly, I am not sure I would tackle this with a regular expression. It seems like it would be simple enough to just walk the text looking for the tags and building a list of pieces as you go.

var pieces = new List();
int current = 0;
while (current < text.Length)
{
    var first = text.IndexOf('<', current);
    if (first != -1)
    {
        var second = text.IndexOf('>', first);
        if (second != -1)
        {
            var tag = text.Substring(first+1, (second-first-1));
            var closeTag = $"";
            var close = text.IndexOf(closeTag, second+1);
            if (close != -1)
            {
                close += closeTag.Length;
                if (current < first)
                {
                    pieces.Add(text.Substring(current, (first-current)).Trim());
                }
                current = close + 1;
                pieces.Add(text.Substring(first, (close-first)).Trim());
            }
            else
            {
                current = second + 1;
            }
        }
        else
        {
            current = first+1;
        }
    }
    else
    {
        pieces.Add(text.Substring(current).Trim());
        break;
    }
}

File with simple text mixed with html - split on parts

Answers (1)

Related Questions