Mikael Koskinen
Mikael Koskinen

Reputation: 12916

Trim too long words from sentences in C#?

I have C# strings which contain sentences. Sometimes these sentences are OK, sometimes they are just user generated random characters. What I would like to do is to trim words inside these sentences. For example given the following string:

var stringWithLongWords = "Here's a text with tooooooooooooo long words";

I would like to run this through a filter:

var trimmed = TrimLongWords(stringWithLongWords, 6);

And to get an output where every word can contain only up to 6 characters:

"Here's a text with tooooo long words"

Any ideas how this could be done with good performance? Is there anything in .NET which could handle this automatically?

I'm currently using the following code:

    private static string TrimLongWords(string original, int maxCount)
    {
        return string.Join(" ", original.Split(' ').Select(x => x.Substring(0, x.Length > maxCount ? maxCount : x.Length)));
    }

Which in theory works, but it provides a bad output if the long word ends with a separator other than space. For example:

This is sweeeeeeeeeeeeeeeet! And something more.

Ends up looking like this:

This is sweeeeeeee And something more.

Update:

OK, the comments were so good that I realized that this may have too many "what ifs". Perhaps it would be better if the separators are forgotten. Instead if a word gets trimmed, it could be shown with three dots. Here's some examples with words trimmed to max 5 characters:

Apocalypse now! -> Apoca... now!

Apocalypse! -> Apoca...

!Example! -> !Exam...

This is sweeeeeeeeeeeeeeeet! And something more. - > This is sweee... And somet... more.

Upvotes: 6

Views: 1178

Answers (9)

Tim Schmelter
Tim Schmelter

Reputation: 460208

This is more efficient than a regex or Linq approach. However, it does not split by words or add .... White-spaces (incl. line-breaks or tab-characters) should also be shortened imho.

public static string TrimLongWords(string original, int maxCount)
{
    if (null == original || original.Length <= maxCount) return original;

    StringBuilder builder = new StringBuilder(original.Length);
    int occurence = 0;

    for (int i = 0; i < original.Length; i++)
    {
        Char current = original[i];
        if (current == original.ElementAtOrDefault(i-1))
            occurence++;
        else
            occurence = 1;
        if (occurence <= maxCount)
            builder.Append(current);
    }
    return builder.ToString();
}

Upvotes: 2

Breealzibub
Breealzibub

Reputation: 8095

A more practical approach might be as @Curt suggested in the comments.

I can't immediately think of any english words which contain 3 identical letters in a row. Rather than simply cutting off a word after 6 characters, you might try this approach: whenever you encounter the same character twice in a row, remove any additional consecutive occurrences of it. Thus "sweeeeeet" becomes "sweet" and "tooooooo" becomes "too."

This would have the additional side-effect of limiting the number of identical punctuation or white space to 2, in case someone was overly zealous with those!!!!!!!!

If you wanted to account for ellipses (...) then just make the "maximum consecutive characters" count == 3, instead of 2.

Upvotes: 2

Riv
Riv

Reputation: 1859

The following will limit the number of repeating characters to 6. So for your input "This is sweeeeeeeeeeeeeeeet! And something more." the output will be:

"This is sweeeeeet! And something more."

string s = "heloooooooooooooooooooooo worrrllllllllllllld!";
char[] chr = s.ToCharArray();
StringBuilder sb = new StringBuilder();
char currentchar = new char();
int charCount = 0;

foreach (char c in chr)
{
     if (c == currentchar)
     {
         charCount++;
     }
     else
     {
         charCount = 0;
     }

     if ( charCount < 6)
     {
         sb.Append(c);
     }

     currentchar = c;
 }

 Console.WriteLine(sb.ToString());
 //Output heloooooo worrrlllllld!

EDIT: Truncate words longer than 6 characters:

string s = "This is sweeeeeeeeeeeeeeeet! And something more.";
string[] words = s.Split(' ');
StringBuilder sb = new StringBuilder();

foreach (string word in words)
{
    char[] chars = word.ToCharArray();
    if (chars.Length > 6)
    {
        for (int i = 0; i < 6; i++)
        {
            sb.Append(chars[i]);
        }
        sb.Append("...").Append(" ");
    }
    else { sb.Append(word).Append(" "); }
}

sb.Remove(sb.Length - 1, 1);
Console.WriteLine(sb.ToString());
//Output: "This is sweeee... And someth... more."

Upvotes: 1

Nolonar
Nolonar

Reputation: 6132

I'd recommend using a StringBuilder together with loops:

public string TrimLongWords(string input, int maxWordLength)
{
    StringBuilder sb = new StringBuilder(input.Length);
    int currentWordLength = 0;
    bool stopTripleDot = false;
    foreach (char c in input)
    {
        bool isLetter = char.IsLetter(c);
        if (currentWordLength < maxWordLength || !isLetter)
        {
            sb.Append(c);
            stopTripleDot = false;
            if (isLetter)
                currentWordLength++;
            else
                currentWordLength = 0;
        }
        else if (!stopTripleDot)
        {
            sb.Append("...");
            stopTripleDot = true;
        }
    }
    return sb.ToString();
}

This would be faster than Regex or Linq.
Expected results for maxWordLength == 6:

"UltraLongWord"           -> "UltraL..."
"This-is-not-a-long-word" -> "This-is-not-a-long-word"

And the edge-case maxWordLength == 0 would result in:

"Please don't trim me!!!" -> "... ...'... ... ...!!!" // poor, poor string...

[This answer has been updated to accommodate the "..." as requested in the question]

(I just realised that replacing the trimmed substrings with "..." has introduced quite a few bugs, and fixing them has rendered my code a bit bulky, sorry)

Upvotes: 4

Joey
Joey

Reputation: 354694

EDIT: Since the requirements changed I'll stay in spirit with regular expressions:

Regex.Replace(original, string.Format(@"(\p{{L}}{{{0}}})\p{{L}}+", maxLength), "$1...");

Output with maxLength = 6:

Here's a text with tooooo... long words
This is sweeee...! And someth... more.

Old answer below, because I liked the approach, even though it's a little ... messy :-).


I hacked together a little regex replacement to do that. It's in PowerShell for now (for prototyping; I'll convert to C# afterwards):

'Here''s a text with tooooooooooooo long words','This is sweeeeeeeeeeeeeeeet! And something more.' |
  % {
    [Regex]::Replace($_, '(\w*?)(\w)\2{2,}(\w*)',
      {
        $m = $args[0]
        if ($m.Value.Length -gt 6) {
          $l = 6 - $m.Groups[1].Length - $m.Groups[3].Length
          $m.Groups[1].Value + $m.Groups[2].Value * $l + $m.Groups[3].Value
        }
      })
  }

Output is:

Here's a text with tooooo long words
This is sweeet! And something more.

What this does is finding runs of characters (\w for now; should be changed to something sensible) that follow the pattern (something)(repeated character more than two times)(something else). For replacement it uses a function that checks whether the length it's over the desired maximum length, then it calculates how long the repeated part can really be to still fit in the total length and then cuts down only the repeated part to that length.

It's messy. It will fail to truncate words that are otherwise very long (e.g. »something« in the second test sentence) and the set of characters that constitute words needs to be changed as well. Consider this maybe a starting point if you want to go that route, but not a finished solution.

C# Code:

public static string TrimLongWords(this string original, int maxCount)
{
    return Regex.Replace(original, @"(\w*?)(\w)\2{2,}(\w*)",
        delegate(Match m) {
            var first = m.Groups[0].Value;
            var rep = m.Groups[1].Value;
            var last = m.Groups[2].Value;
            if (m.Value.Length > maxCount) {
                var l = maxCount - first.Length - last.Length;
                return first + new string(rep[0], l) + last;
            }
            return m.Value;
        });
}

A nicer option for the character class would probably be something like \p{L}, depending on your needs.

Upvotes: 4

sloth
sloth

Reputation: 101122

Using a simple Regex with an zero-width positive lookbehind assertion (LinqPad-ready example code):

void Main()
{
    foreach(var s in new [] { "Here's a text with tooooooooooooo long words", 
                              "This is sweeeeeeeeeeeeeeeet! And something more.",
                              "Apocalypse now!",
                              "Apocalypse!",
                              "!Example!"})
        Regex.Replace(s, @"(?<=\w{5,})\S+", "...").Dump();

}

It looks for any non-space character after 5 word characters and replaces the match with ....

Result:

Here's a text with toooo... long words
This is sweee... And somet... more.
Apoca... now!
Apoca...
!Examp...

Upvotes: 2

Alex Filipovici
Alex Filipovici

Reputation: 32561

Try this:

class Program
{
    static void Main(string[] args)
    {
        var stringWithLongWords = "Here's a text with tooooooooooooo long words";
        var trimmed = TrimLongWords(stringWithLongWords, 6);
    }

    private static string TrimLongWords(string stringWithLongWords, int p)
    {
        return Regex.Replace(stringWithLongWords, String.Format(@"[\w]{{{0},}}", p), m =>
        {
            return m.Value.Substring(0, p-1) + "...";
        });
    }
}

Upvotes: 2

cansik
cansik

Reputation: 2004

You could use regex to find those repetitions:


string test = "This is sweeeeeeeeeeeeeeeet! And sooooooomething more.";
string result = Regex.Replace(test, @"(\w)\1+", delegate(Match match)
{
    string v = match.ToString();
    return v[0].ToString();
});

The result would be:


This is swet! And something more.

And maybe you could check the manipulated words with a spellchecker service: http://wiki.webspellchecker.net/doku.php?id=installationandconfiguration:web_service

Upvotes: 2

dav_i
dav_i

Reputation: 28127

Try this:

private static string TrimLongWords(string original, int maxCount)
{
   return string.Join(" ", 
   original.Split(' ')
   .Select(x => { 
     var r = Regex.Replace(x, @"\W", ""); 
     return r.Substring(0, r.Length > maxCount ? maxCount : r.Length) + Regex.Replace(x, @"\w", ""); 
   }));
}

Then TrimLongWords("This is sweeeeeeeeeeeeeeeet! And something more.", 5) becomes "This is sweee! And somet more."

Upvotes: 2

Related Questions