Reputation: 6255
This problem is a challenging one. Our application allows users to post news on the homepage. That news is input via a rich text editor which allows HTML. On the homepage we want to only display a truncated summary of the news item.
For example, here is the full text we are displaying, including HTML
In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us.
We want to trim the news item to 250 characters, but exclude HTML.
The method we are using for trimming currently includes the HTML, and this results in some news posts that are HTML heavy getting truncated considerably.
For instance, if the above example included tons of HTML, it could potentially look like this:
In an attempt to make a bit more space in the office, kitchen, I've pulled...
This is not what we want.
Does anyone have a way of tokenizing HTML tags in order to maintain position in the string, perform a length check and/or trim on the string, and restore the HTML inside the string at its old location?
Upvotes: 9
Views: 8407
Reputation: 841
You can try the following npm package
It cutting off sufficient text inside html tags, save original html stricture, remove html tags after limit is reached and closing opened tags.
Upvotes: 1
Reputation: 81
Following the 2-state finite machine suggestion, I've just developed a simple HTML parser for this purpose, in Java:
and here a test case:
And here the Java code:
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
public class HtmlShortener {
private static final String TAGS_TO_SKIP = "br,hr,img,link";
private static final String[] tagsToSkip = TAGS_TO_SKIP.split(",");
private static final int STATUS_READY = 0;
private int cutPoint = -1;
private String htmlString = "";
final List<String> tags = new LinkedList<String>();
StringBuilder sb = new StringBuilder("");
StringBuilder tagSb = new StringBuilder("");
int charCount = 0;
int status = STATUS_READY;
public HtmlShortener(String htmlString, int cutPoint){
this.cutPoint = cutPoint;
this.htmlString = htmlString;
}
public String cut(){
// reset
tags.clear();
sb = new StringBuilder("");
tagSb = new StringBuilder("");
charCount = 0;
status = STATUS_READY;
String tag = "";
if (cutPoint < 0){
return htmlString;
}
if (null != htmlString){
if (cutPoint == 0){
return "";
}
for (int i = 0; i < htmlString.length(); i++){
String strC = htmlString.substring(i, i+1);
if (strC.equals("<")){
// new tag or tag closure
// previous tag reset
tagSb = new StringBuilder("");
tag = "";
// find tag type and name
for (int k = i; k < htmlString.length(); k++){
String tagC = htmlString.substring(k, k+1);
tagSb.append(tagC);
if (tagC.equals(">")){
tag = getTag(tagSb.toString());
if (tag.startsWith("/")){
// closure
if (!isToSkip(tag)){
sb.append("</").append(tags.get(tags.size() - 1)).append(">");
tags.remove((tags.size() - 1));
}
} else {
// new tag
sb.append(tagSb.toString());
if (!isToSkip(tag)){
tags.add(tag);
}
}
i = k;
break;
}
}
} else {
sb.append(strC);
charCount++;
}
// cut check
if (charCount >= cutPoint){
// close previously open tags
Collections.reverse(tags);
for (String t : tags){
sb.append("</").append(t).append(">");
}
break;
}
}
return sb.toString();
} else {
return null;
}
}
private boolean isToSkip(String tag) {
if (tag.startsWith("/")){
tag = tag.substring(1, tag.length());
}
for (String tagToSkip : tagsToSkip){
if (tagToSkip.equals(tag)){
return true;
}
}
return false;
}
private String getTag(String tagString) {
if (tagString.contains(" ")){
// tag with attributes
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(" "));
} else {
// simple tag
return tagString.substring(tagString.indexOf("<") + 1, tagString.indexOf(">"));
}
}
}
Upvotes: 2
Reputation: 2441
I'm aware this is quite a bit after the posted date, but i had a similiar issue and this is how i ended up solving it. My concern would be the speed of regex versus interating through an array.
Also if you have a space before an html tag, and after this doesn't fix that
private string HtmlTrimmer(string input, int len)
{
if (string.IsNullOrEmpty(input))
return string.Empty;
if (input.Length <= len)
return input;
// this is necissary because regex "^" applies to the start of the string, not where you tell it to start from
string inputCopy;
string tag;
string result = "";
int strLen = 0;
int strMarker = 0;
int inputLength = input.Length;
Stack stack = new Stack(10);
Regex text = new Regex("^[^<&]+");
Regex singleUseTag = new Regex("^<[^>]*?/>");
Regex specChar = new Regex("^&[^;]*?;");
Regex htmlTag = new Regex("^<.*?>");
while (strLen < len)
{
inputCopy = input.Substring(strMarker);
//If the marker is at the end of the string OR
//the sum of the remaining characters and those analyzed is less then the maxlength
if (strMarker >= inputLength || (inputLength - strMarker) + strLen < len)
break;
//Match regular text
result += text.Match(inputCopy,0,len-strLen);
strLen += result.Length - strMarker;
strMarker = result.Length;
inputCopy = input.Substring(strMarker);
if (singleUseTag.IsMatch(inputCopy))
result += singleUseTag.Match(inputCopy);
else if (specChar.IsMatch(inputCopy))
{
//think of as 1 character instead of 5
result += specChar.Match(inputCopy);
++strLen;
}
else if (htmlTag.IsMatch(inputCopy))
{
tag = htmlTag.Match(inputCopy).ToString();
//This only works if this is valid Markup...
if(tag[1]=='/') //Closing tag
stack.Pop();
else //not a closing tag
stack.Push(tag);
result += tag;
}
else //Bad syntax
result += input[strMarker];
strMarker = result.Length;
}
while (stack.Count > 0)
{
tag = stack.Pop().ToString();
result += tag.Insert(1, "/");
}
if (strLen == len)
result += "...";
return result;
}
Upvotes: 0
Reputation: 6255
Here's the implementation that I came up with, in C#:
public static string TrimToLength(string input, int length)
{
if (string.IsNullOrEmpty(input))
return string.Empty;
if (input.Length <= length)
return input;
bool inTag = false;
int targetLength = 0;
for (int i = 0; i < input.Length; i++)
{
char c = input[i];
if (c == '>')
{
inTag = false;
continue;
}
if (c == '<')
{
inTag = true;
continue;
}
if (inTag || char.IsWhiteSpace(c))
{
continue;
}
targetLength++;
if (targetLength == length)
{
return ConvertToXhtml(input.Substring(0, i + 1));
}
}
return input;
}
And a few unit tests I used via TDD:
[Test]
public void Html_TrimReturnsEmptyStringWhenNullPassed()
{
Assert.That(Html.TrimToLength(null, 1000), Is.Empty);
}
[Test]
public void Html_TrimReturnsEmptyStringWhenEmptyPassed()
{
Assert.That(Html.TrimToLength(string.Empty, 1000), Is.Empty);
}
[Test]
public void Html_TrimReturnsUnmodifiedStringWhenSameAsLength()
{
string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I";
Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(source));
}
[Test]
public void Html_TrimWellFormedHtml()
{
string source = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
"In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>" +
"</div>";
string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";
Assert.That(Html.TrimToLength(source, 250), Is.EqualTo(expected));
}
[Test]
public void Html_TrimMalformedHtml()
{
string malformedHtml = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in a box and donated to an office in more need of mugs than us. <br/><br/>" +
"In the meantime we have a nice selection of white Ikea mugs, some random Starbucks mugs, and others that have made their way into the office over the years. Hopefully that will suffice. <br/><br/>";
string expected = "<div lang=\"en\" class=\"textBody localizable\" id=\"pageBody_en\">" +
"<img photoid=\"4041\" src=\"http://xxxxxxxx/imagethumb/562103830000/4041/300x300/False/mugs.jpg\" style=\"float: right;\" class=\"photoRight\" alt=\"\"/>" +
"<br/>" +
"In an attempt to make a bit more space in the office, kitchen, I've pulled out all of the random mugs and put them onto the lunch room table. Unless you feel strongly about the ownership of that Cheyenne Courier mug from 1992 or perhaps that BC Tel Advanced Communications mug from 1997, they will be put in";
Assert.That(Html.TrimToLength(malformedHtml, 250), Is.EqualTo(expected));
}
Upvotes: 0
Reputation: 347226
If I understand the problem correctly, you want to keep the HTML formatting, but you want to not count it as part of the length of the string you are keeping.
You can accomplish this with code that implements a simple finite state machine.
2 states: InTag, OutOfTag
InTag:
- Goes to OutOfTag if>
character is encountered
- Goes to itself any other character is encountered
OutOfTag:
- Goes to InTag if<
character is encountered
- Goes to itself any other character is encountered
Your starting state will be OutOfTag.
You implement a finite state machine by procesing 1 character at a time. The processing of each character brings you to a new state.
As you run your text through the finite state machine, you want to also keep an output buffer and a length so far encountered varaible (so you know when to stop).
Upvotes: 0
Reputation: 74558
Start at the first character of the post, stepping over each character. Every time you step over a character, increment a counter. When you find a '<' character, stop incrementing the counter until you hit a '>' character. Your position when the counter gets to 250 is where you actually want to cut off.
Take note that this will have another problem that you'll have to deal with when an HTML tag is opened but not closed before the cutoff.
Upvotes: 10
Reputation: 16848
Wouldn't the fastest way be to use jQuery's text()
method?
For example:
<ul>
<li>One</li>
<li>Two</li>
<li>Three</li>
</ul>
var text = $('ul').text();
Would give the value OneTwoThree in the text
variable. This would allow you to get the actual length of the text without the HTML included.
Upvotes: -1