Aabela
Aabela

Reputation: 1418

Removing words with special characters in them

I have a long string composed of a number of different words.

I want to go through all of them, and if the word contains a special character or number (except '-'), or starts with a Capital letter, I want to delete it (the whole word not just that character). For all intents and purposes 'foreign' letters can count as special characters.

The obvious solution is to run a loop through each word (after splitting it) and then a loop through each character - but I'm hoping there's a faster way of doing it? Perhaps using Regex but I've almost no experience with it.

Thanks

ADDED:

(What I want for example:)

Input: "this Is an Example of 5 words in an input like-so from example.com"

Output: {this,an,of,words,in,an,input,like-so,from}

(What I've tried so far)

List<string> response = new List<string>();

string[] splitString = text.Split(' ');

foreach (string s in splitString)
{
    bool add = true;
    foreach (char c in s.ToCharArray())
    {
         if (!(c.Equals('-') || (Char.IsLetter(c) && Char.IsLower(c))))
         {
             add = false;
             break;
         }
         if (add)
         {
             response.Add(s);
         }
    }
}

Edit 2:

For me a word should be a number of characters (a..z) seperated by a space. ,/./!/... at the end shouldn't count for the 'special character' condition (which is really mostly just to remove urls or the like)

So: "I saw a dog. It was black!" should result in {saw,a,dog,was,black}

Upvotes: 1

Views: 4284

Answers (7)

Mark M
Mark M

Reputation: 976

How about this?

(?<=^|\s+)(?[a-z-]+)(?=$|\s+)

Edit: Meant (?<=^|\s+)(?<word>[a-z\-]+)(?=(?:\.|,|!|\.\.\.)?(?:$|\s+))

Rules:

  1. Word can only be preceded by start of line or some number of whitespace characters
  2. Word can only be followed by end of line or some number of whitespace characters (Edit supports words ending with periods, commas, exclamation points, and ellipses)
  3. Word can only contain lower case (latin) letters and dashes

The named group containing each word is "word"

Upvotes: 1

Qtax
Qtax

Reputation: 33908

So you want to find all "words" that only contain characters a-z or -, for words that are separated by spaces?

A regex like this will find such words:

(?<!\S)[a-z-]+(?!\S)

To also allow for words that end with single punctuation, you could use:

(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))

Example (ideone):

var re = @"(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))";
var str = "this, Is an! Example of 5 words in an input like-so from example.com foo: bar?";

var m = Regex.Matches(str, re);

Console.WriteLine("Matched: ");
foreach (Match i in m)
    Console.Write(i + " ");

Notice the punctuation in the string.

Output:

Matched: 
this an of words in an input like-so from foo bar 

Upvotes: 2

Kara Potts
Kara Potts

Reputation: 1062

You can use look-aheads and look-behinds to do this. Here's a regex that matches your example:

(?<=\s|^)[a-z-]+(?=\s|$)

The explanation is: match one or more alphabetic characters (lowercase only, plus hyphen), as long as what comes before the characters is whitespace (or the start of the string), and as long as what comes after is whitespace or the end of the string.

All you need to do now is plug that into System.Text.RegularExpressions.Regex.Matches(input, regexString) to get your list of words.

Reference: http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet

Upvotes: 0

Jo&#227;o Angelo
Jo&#227;o Angelo

Reputation: 57688

You can do this in two ways, the white-list way and the black-list way. With a white-list you define the set of characters that you consider to be acceptable and with the black-list its the opposite.

Lets assume the white-list way and that you accept only characters a-z, A-Z and the - character. Additionally you have the rule that the first character of a word cannot be an upper case character.

With this you can do something like this:

string target = "This is a white-list example: (Foo, bar1)";

var matches = Regex.Matches(target, @"(?:\b)(?<Word>[a-z]{1}[a-zA-Z\-]*)(?:\b)");

string[] words = matches.Cast<Match>().Select(m => m.Value).ToArray();

Console.WriteLine(string.Join(", ", words));

Outputs:

// is, a, white-list, example

Upvotes: 0

Siddharth
Siddharth

Reputation: 102

This could be a starting point. right now it just checks only for "." as a special char. This outputs : "this an of words in an like-so from"

        string pattern = @"[A-Z]\w+|\w*[0-9]+\w*|\w*[\.]+\w*";
        string line = "this Is an Example of 5 words in an in3put like-so from example.com";

        System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(pattern);
        line = r.Replace(line,"");

Upvotes: 0

jrb
jrb

Reputation: 1728

List<string> strings = new List<string>() {"asdf", "sdf-sd", "sdfsdf"};

for (int i = strings.Count-1; i > 0; i--)
{
   if (strings[i].Contains("-"))
   {
       strings.Remove(strings[i]);
   }
}

Upvotes: 0

learner
learner

Reputation: 1975

Have a look at Microsoft's How to: Search Strings Using Regular Expressions (C# Programming Guide) - it's about regexes in C#.

Upvotes: 0

Related Questions