Shameem
Shameem

Reputation:

Getting a substring of text containing HTML tags

Getting a substring of text containing HTML tags

Assume that you want the first 10 characters of the following:

"<p>this is paragraph 1</p>

this is paragraph 2</p>"

The output would be:

"<p>this is"

The returned text contains an unclosed P tag. If this is rendered to a page, subsequent content will be affected by the open P tag. Ideally, the preferred output would close any unclosed HTML tags in reverse of when they were opened:

"<p>this is</p>" I want a function that returns a subtring of HTML, making sure that no tags are left unclosed

Upvotes: 4

Views: 7320

Answers (5)

imxylz
imxylz

Reputation: 7937

try this code (python 3.x):

notags=('img','br','hr')
def substring2(html,size):
    if len(html) <= size:
        return html
    result,tag,count='','',0
    tags=[]
    for c in html:
        result += c
        if c == '<':
            intag=True
        elif c=='>':
            intag=False
            tag=tag.split()[0]
            if tag[0] == '/':
                tag = tag.replace('/','')
                if tag not in notags:
                    tags.pop()
            else:
                if tag[-1] != '/' and tag not in notags:
                    tags.append(tag)
            tag=''
        else:
            if intag: 
                tag += c
            else:
                count+=1
                if count>=size: break
    while len(tags)>0:
        result += '</{0}>'.format(tags.pop())
    return result

s='<div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a> language</div>'
print(s)
for size in (30,40,55):
    print(substring2(s,size))

output

<div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a> language</div>
<div class="main">html <code>substring</code> function writte</div>
<div class="main">html <code>substring</code> function written by <span>imxyl</span></div>
<div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a></div>

more

See code at github.

Another question.

Upvotes: 0

Chuhukon
Chuhukon

Reputation: 21

You can use the next static function. For a working example check: http://www.koodr.com/item/438c2e9c-62a8-45fc-9ca2-db1479f412e1 . You can also turn this into a extensionmethod.

public static string HtmlSubstring (string html, int maxlength) {
//initialize regular expressions
string htmltag = "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
string emptytags = "<(\\w+)((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?></\\1>";

//match all html start and end tags, otherwise get each character one by one..
var expression = new Regex(string.Format("({0})|(.?)", htmltag)); 
MatchCollection matches = expression.Matches(html);

int i = 0;
StringBuilder content = new StringBuilder();
foreach (Match match in matches)
{
    if (match.Value.Length == 1
        && i < maxlength) 
    {                    
        content.Append(match.Value);
        i++; 
    }
    //the match contains a tag
    else if (match.Value.Length > 1) 
        content.Append(match.Value);
}

return Regex.Replace(content.ToString(), emptytags, string.Empty); }

Upvotes: 2

Rahul
Rahul

Reputation: 12231

You need to teach your code how to understand that your string is actually HTML or XML. Just treating it like a string won't allow you to work with it the way you want to. This means first transforming it to the correct format and then working with that format.

Use an XSL stylesheet

If your HTML is well-formed XML, load it into an XMLDocument and run it through an XSL stylesheet that does something like the following:

<xsl:template match="p">
  <xsl:value-of select="substring(text(), 0, 10)" />
</xsl:template>

Use an HTML parser

If it's not well-formed XML (as in your example, where you have a sudden </p> in the middle), you'll need to use a HTML parser of some kind, such as HTML Agility Pack (see this question about C# HTML parsers).

Don't use regular expressions, since HTML is too complex to parse using regex.

Upvotes: 3

Cerebrus
Cerebrus

Reputation: 25775

Your requirement is very unclear so most of this is guesswork. Also, you have provided no code which would help to clarify what it is you want to do.

One solution could be:

a. Find the text between the <p> and the </p> tags. You can use the following Regex for this or use a simple string search:

\<p\>(.*?)\</p\>

b. In the found text, apply a Substring() to extract the required text.

c. Put back the extracted text between the <p> and the </p> tags.

Upvotes: 1

Fenton
Fenton

Reputation: 251242

You could loop over the html string to detect the angle brackets and build up an array of tags and whether there was a matching closing tag for each one. The problem is, HTML allows for non closing tags, such as img, br, meta - so you'd need to know about those. You would also need to have rules to check the order of closing, because just matching an open with a close doesn't make valid HTML - if you open a div, then a p and then close the div and then close the p, that isn't valid.

Upvotes: 0

Related Questions