Reputation:
Getting a substring of text containing HTML tags
Assume that you want the first 10 characters of the following:
"<p>this is paragraph 1</p>
this is paragraph 2</p>"
The output would be:
"<p>this is"
The returned text contains an unclosed P tag. If this is rendered to a page, subsequent content will be affected by the open P tag. Ideally, the preferred output would close any unclosed HTML tags in reverse of when they were opened:
"<p>this is</p>" I want a function that returns a subtring of HTML, making sure that no tags are left unclosed
Upvotes: 4
Views: 7320
Reputation: 7937
try this code (python 3.x):
notags=('img','br','hr')
def substring2(html,size):
if len(html) <= size:
return html
result,tag,count='','',0
tags=[]
for c in html:
result += c
if c == '<':
intag=True
elif c=='>':
intag=False
tag=tag.split()[0]
if tag[0] == '/':
tag = tag.replace('/','')
if tag not in notags:
tags.pop()
else:
if tag[-1] != '/' and tag not in notags:
tags.append(tag)
tag=''
else:
if intag:
tag += c
else:
count+=1
if count>=size: break
while len(tags)>0:
result += '</{0}>'.format(tags.pop())
return result
s='<div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a> language</div>'
print(s)
for size in (30,40,55):
print(substring2(s,size))
output
<div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a> language</div>
<div class="main">html <code>substring</code> function writte</div>
<div class="main">html <code>substring</code> function written by <span>imxyl</span></div>
<div class="main">html <code>substring</code> function written by <span>imxylz</span>, using <a href="http://www.python.org">python</a></div>
more
See code at github.
Another question.
Upvotes: 0
Reputation: 21
You can use the next static function. For a working example check: http://www.koodr.com/item/438c2e9c-62a8-45fc-9ca2-db1479f412e1 . You can also turn this into a extensionmethod.
public static string HtmlSubstring (string html, int maxlength) {
//initialize regular expressions
string htmltag = "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
string emptytags = "<(\\w+)((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?></\\1>";
//match all html start and end tags, otherwise get each character one by one..
var expression = new Regex(string.Format("({0})|(.?)", htmltag));
MatchCollection matches = expression.Matches(html);
int i = 0;
StringBuilder content = new StringBuilder();
foreach (Match match in matches)
{
if (match.Value.Length == 1
&& i < maxlength)
{
content.Append(match.Value);
i++;
}
//the match contains a tag
else if (match.Value.Length > 1)
content.Append(match.Value);
}
return Regex.Replace(content.ToString(), emptytags, string.Empty); }
Upvotes: 2
Reputation: 12231
You need to teach your code how to understand that your string is actually HTML or XML. Just treating it like a string won't allow you to work with it the way you want to. This means first transforming it to the correct format and then working with that format.
If your HTML is well-formed XML, load it into an XMLDocument
and run it through an XSL stylesheet that does something like the following:
<xsl:template match="p">
<xsl:value-of select="substring(text(), 0, 10)" />
</xsl:template>
If it's not well-formed XML (as in your example, where you have a sudden </p>
in the middle), you'll need to use a HTML parser of some kind, such as HTML Agility Pack (see this question about C# HTML parsers).
Don't use regular expressions, since HTML is too complex to parse using regex.
Upvotes: 3
Reputation: 25775
Your requirement is very unclear so most of this is guesswork. Also, you have provided no code which would help to clarify what it is you want to do.
One solution could be:
a. Find the text between the <p>
and the </p>
tags. You can use the following Regex for this or use a simple string search:
\<p\>(.*?)\</p\>
b. In the found text, apply a Substring()
to extract the required text.
c. Put back the extracted text between the <p>
and the </p>
tags.
Upvotes: 1
Reputation: 251242
You could loop over the html string to detect the angle brackets and build up an array of tags and whether there was a matching closing tag for each one. The problem is, HTML allows for non closing tags, such as img, br, meta - so you'd need to know about those. You would also need to have rules to check the order of closing, because just matching an open with a close doesn't make valid HTML - if you open a div, then a p and then close the div and then close the p, that isn't valid.
Upvotes: 0