Reputation: 12916
I have C# strings which contain sentences. Sometimes these sentences are OK, sometimes they are just user generated random characters. What I would like to do is to trim words inside these sentences. For example given the following string:
var stringWithLongWords = "Here's a text with tooooooooooooo long words";
I would like to run this through a filter:
var trimmed = TrimLongWords(stringWithLongWords, 6);
And to get an output where every word can contain only up to 6 characters:
"Here's a text with tooooo long words"
Any ideas how this could be done with good performance? Is there anything in .NET which could handle this automatically?
I'm currently using the following code:
private static string TrimLongWords(string original, int maxCount)
{
return string.Join(" ", original.Split(' ').Select(x => x.Substring(0, x.Length > maxCount ? maxCount : x.Length)));
}
Which in theory works, but it provides a bad output if the long word ends with a separator other than space. For example:
This is sweeeeeeeeeeeeeeeet! And something more.
Ends up looking like this:
This is sweeeeeeee And something more.
Update:
OK, the comments were so good that I realized that this may have too many "what ifs". Perhaps it would be better if the separators are forgotten. Instead if a word gets trimmed, it could be shown with three dots. Here's some examples with words trimmed to max 5 characters:
Apocalypse now! -> Apoca... now!
Apocalypse! -> Apoca...
!Example! -> !Exam...
This is sweeeeeeeeeeeeeeeet! And something more. - > This is sweee... And somet... more.
Upvotes: 6
Views: 1178
Reputation: 460208
This is more efficient than a regex or Linq approach. However, it does not split by words or add ...
. White-spaces (incl. line-breaks or tab-characters) should also be shortened imho.
public static string TrimLongWords(string original, int maxCount)
{
if (null == original || original.Length <= maxCount) return original;
StringBuilder builder = new StringBuilder(original.Length);
int occurence = 0;
for (int i = 0; i < original.Length; i++)
{
Char current = original[i];
if (current == original.ElementAtOrDefault(i-1))
occurence++;
else
occurence = 1;
if (occurence <= maxCount)
builder.Append(current);
}
return builder.ToString();
}
Upvotes: 2
Reputation: 8095
A more practical approach might be as @Curt suggested in the comments.
I can't immediately think of any english words which contain 3 identical letters in a row. Rather than simply cutting off a word after 6 characters, you might try this approach: whenever you encounter the same character twice in a row, remove any additional consecutive occurrences of it. Thus "sweeeeeet" becomes "sweet" and "tooooooo" becomes "too."
This would have the additional side-effect of limiting the number of identical punctuation or white space to 2, in case someone was overly zealous with those!!!!!!!!
If you wanted to account for ellipses (...) then just make the "maximum consecutive characters" count == 3, instead of 2.
Upvotes: 2
Reputation: 1859
The following will limit the number of repeating characters to 6. So for your input "This is sweeeeeeeeeeeeeeeet! And something more." the output will be:
"This is sweeeeeet! And something more."
string s = "heloooooooooooooooooooooo worrrllllllllllllld!";
char[] chr = s.ToCharArray();
StringBuilder sb = new StringBuilder();
char currentchar = new char();
int charCount = 0;
foreach (char c in chr)
{
if (c == currentchar)
{
charCount++;
}
else
{
charCount = 0;
}
if ( charCount < 6)
{
sb.Append(c);
}
currentchar = c;
}
Console.WriteLine(sb.ToString());
//Output heloooooo worrrlllllld!
EDIT: Truncate words longer than 6 characters:
string s = "This is sweeeeeeeeeeeeeeeet! And something more.";
string[] words = s.Split(' ');
StringBuilder sb = new StringBuilder();
foreach (string word in words)
{
char[] chars = word.ToCharArray();
if (chars.Length > 6)
{
for (int i = 0; i < 6; i++)
{
sb.Append(chars[i]);
}
sb.Append("...").Append(" ");
}
else { sb.Append(word).Append(" "); }
}
sb.Remove(sb.Length - 1, 1);
Console.WriteLine(sb.ToString());
//Output: "This is sweeee... And someth... more."
Upvotes: 1
Reputation: 6132
I'd recommend using a StringBuilder
together with loops:
public string TrimLongWords(string input, int maxWordLength)
{
StringBuilder sb = new StringBuilder(input.Length);
int currentWordLength = 0;
bool stopTripleDot = false;
foreach (char c in input)
{
bool isLetter = char.IsLetter(c);
if (currentWordLength < maxWordLength || !isLetter)
{
sb.Append(c);
stopTripleDot = false;
if (isLetter)
currentWordLength++;
else
currentWordLength = 0;
}
else if (!stopTripleDot)
{
sb.Append("...");
stopTripleDot = true;
}
}
return sb.ToString();
}
This would be faster than Regex
or Linq.
Expected results for maxWordLength == 6
:
"UltraLongWord" -> "UltraL..."
"This-is-not-a-long-word" -> "This-is-not-a-long-word"
And the edge-case maxWordLength == 0
would result in:
"Please don't trim me!!!" -> "... ...'... ... ...!!!" // poor, poor string...
"..."
as requested in the question](I just realised that replacing the trimmed substrings with "..."
has introduced quite a few bugs, and fixing them has rendered my code a bit bulky, sorry)
Upvotes: 4
Reputation: 354694
EDIT: Since the requirements changed I'll stay in spirit with regular expressions:
Regex.Replace(original, string.Format(@"(\p{{L}}{{{0}}})\p{{L}}+", maxLength), "$1...");
Output with maxLength = 6:
Here's a text with tooooo... long words
This is sweeee...! And someth... more.
Old answer below, because I liked the approach, even though it's a little ... messy :-).
I hacked together a little regex replacement to do that. It's in PowerShell for now (for prototyping; I'll convert to C# afterwards):
'Here''s a text with tooooooooooooo long words','This is sweeeeeeeeeeeeeeeet! And something more.' |
% {
[Regex]::Replace($_, '(\w*?)(\w)\2{2,}(\w*)',
{
$m = $args[0]
if ($m.Value.Length -gt 6) {
$l = 6 - $m.Groups[1].Length - $m.Groups[3].Length
$m.Groups[1].Value + $m.Groups[2].Value * $l + $m.Groups[3].Value
}
})
}
Output is:
Here's a text with tooooo long words
This is sweeet! And something more.
What this does is finding runs of characters (\w
for now; should be changed to something sensible) that follow the pattern (something)(repeated character more than two times)(something else)
. For replacement it uses a function that checks whether the length it's over the desired maximum length, then it calculates how long the repeated part can really be to still fit in the total length and then cuts down only the repeated part to that length.
It's messy. It will fail to truncate words that are otherwise very long (e.g. »something« in the second test sentence) and the set of characters that constitute words needs to be changed as well. Consider this maybe a starting point if you want to go that route, but not a finished solution.
C# Code:
public static string TrimLongWords(this string original, int maxCount)
{
return Regex.Replace(original, @"(\w*?)(\w)\2{2,}(\w*)",
delegate(Match m) {
var first = m.Groups[0].Value;
var rep = m.Groups[1].Value;
var last = m.Groups[2].Value;
if (m.Value.Length > maxCount) {
var l = maxCount - first.Length - last.Length;
return first + new string(rep[0], l) + last;
}
return m.Value;
});
}
A nicer option for the character class would probably be something like \p{L}
, depending on your needs.
Upvotes: 4
Reputation: 101122
Using a simple Regex with an zero-width positive lookbehind assertion (LinqPad-ready example code):
void Main()
{
foreach(var s in new [] { "Here's a text with tooooooooooooo long words",
"This is sweeeeeeeeeeeeeeeet! And something more.",
"Apocalypse now!",
"Apocalypse!",
"!Example!"})
Regex.Replace(s, @"(?<=\w{5,})\S+", "...").Dump();
}
It looks for any non-space character after 5 word characters and replaces the match with ...
.
Result:
Here's a text with toooo... long words
This is sweee... And somet... more.
Apoca... now!
Apoca...
!Examp...
Upvotes: 2
Reputation: 32561
Try this:
class Program
{
static void Main(string[] args)
{
var stringWithLongWords = "Here's a text with tooooooooooooo long words";
var trimmed = TrimLongWords(stringWithLongWords, 6);
}
private static string TrimLongWords(string stringWithLongWords, int p)
{
return Regex.Replace(stringWithLongWords, String.Format(@"[\w]{{{0},}}", p), m =>
{
return m.Value.Substring(0, p-1) + "...";
});
}
}
Upvotes: 2
Reputation: 2004
You could use regex to find those repetitions:
string test = "This is sweeeeeeeeeeeeeeeet! And sooooooomething more.";
string result = Regex.Replace(test, @"(\w)\1+", delegate(Match match)
{
string v = match.ToString();
return v[0].ToString();
});
The result would be:
This is swet! And something more.
And maybe you could check the manipulated words with a spellchecker service: http://wiki.webspellchecker.net/doku.php?id=installationandconfiguration:web_service
Upvotes: 2
Reputation: 28127
Try this:
private static string TrimLongWords(string original, int maxCount)
{
return string.Join(" ",
original.Split(' ')
.Select(x => {
var r = Regex.Replace(x, @"\W", "");
return r.Substring(0, r.Length > maxCount ? maxCount : r.Length) + Regex.Replace(x, @"\w", "");
}));
}
Then TrimLongWords("This is sweeeeeeeeeeeeeeeet! And something more.", 5)
becomes "This is sweee! And somet more."
Upvotes: 2