Reputation: 1418
I have a long string composed of a number of different words.
I want to go through all of them, and if the word contains a special character or number (except '-'), or starts with a Capital letter, I want to delete it (the whole word not just that character). For all intents and purposes 'foreign' letters can count as special characters.
The obvious solution is to run a loop through each word (after splitting it) and then a loop through each character - but I'm hoping there's a faster way of doing it? Perhaps using Regex but I've almost no experience with it.
Thanks
ADDED:
(What I want for example:)
Input: "this Is an Example of 5 words in an input like-so from example.com"
Output: {this,an,of,words,in,an,input,like-so,from}
(What I've tried so far)
List<string> response = new List<string>();
string[] splitString = text.Split(' ');
foreach (string s in splitString)
{
bool add = true;
foreach (char c in s.ToCharArray())
{
if (!(c.Equals('-') || (Char.IsLetter(c) && Char.IsLower(c))))
{
add = false;
break;
}
if (add)
{
response.Add(s);
}
}
}
Edit 2:
For me a word should be a number of characters (a..z) seperated by a space. ,/./!/... at the end shouldn't count for the 'special character' condition (which is really mostly just to remove urls or the like)
So: "I saw a dog. It was black!" should result in {saw,a,dog,was,black}
Upvotes: 1
Views: 4284
Reputation: 976
How about this?
(?<=^|\s+)(?[a-z-]+)(?=$|\s+)
Edit: Meant (?<=^|\s+)(?<word>[a-z\-]+)(?=(?:\.|,|!|\.\.\.)?(?:$|\s+))
Rules:
The named group containing each word is "word"
Upvotes: 1
Reputation: 33908
So you want to find all "words" that only contain characters a-z
or -
, for words that are separated by spaces?
A regex like this will find such words:
(?<!\S)[a-z-]+(?!\S)
To also allow for words that end with single punctuation, you could use:
(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))
var re = @"(?<!\S)[a-z-]+(?=[,.!?:;]?(?!\S))";
var str = "this, Is an! Example of 5 words in an input like-so from example.com foo: bar?";
var m = Regex.Matches(str, re);
Console.WriteLine("Matched: ");
foreach (Match i in m)
Console.Write(i + " ");
Notice the punctuation in the string.
Output:
Matched:
this an of words in an input like-so from foo bar
Upvotes: 2
Reputation: 1062
You can use look-aheads and look-behinds to do this. Here's a regex that matches your example:
(?<=\s|^)[a-z-]+(?=\s|$)
The explanation is: match one or more alphabetic characters (lowercase only, plus hyphen), as long as what comes before the characters is whitespace (or the start of the string), and as long as what comes after is whitespace or the end of the string.
All you need to do now is plug that into System.Text.RegularExpressions.Regex.Matches(input, regexString)
to get your list of words.
Reference: http://www.mikesdotnetting.com/Article/46/CSharp-Regular-Expressions-Cheat-Sheet
Upvotes: 0
Reputation: 57688
You can do this in two ways, the white-list way and the black-list way. With a white-list you define the set of characters that you consider to be acceptable and with the black-list its the opposite.
Lets assume the white-list way and that you accept only characters a-z
, A-Z
and the -
character. Additionally you have the rule that the first character of a word cannot be an upper case character.
With this you can do something like this:
string target = "This is a white-list example: (Foo, bar1)";
var matches = Regex.Matches(target, @"(?:\b)(?<Word>[a-z]{1}[a-zA-Z\-]*)(?:\b)");
string[] words = matches.Cast<Match>().Select(m => m.Value).ToArray();
Console.WriteLine(string.Join(", ", words));
Outputs:
// is, a, white-list, example
Upvotes: 0
Reputation: 102
This could be a starting point. right now it just checks only for "." as a special char. This outputs : "this an of words in an like-so from"
string pattern = @"[A-Z]\w+|\w*[0-9]+\w*|\w*[\.]+\w*";
string line = "this Is an Example of 5 words in an in3put like-so from example.com";
System.Text.RegularExpressions.Regex r = new System.Text.RegularExpressions.Regex(pattern);
line = r.Replace(line,"");
Upvotes: 0
Reputation: 1728
List<string> strings = new List<string>() {"asdf", "sdf-sd", "sdfsdf"};
for (int i = strings.Count-1; i > 0; i--)
{
if (strings[i].Contains("-"))
{
strings.Remove(strings[i]);
}
}
Upvotes: 0
Reputation: 1975
Have a look at Microsoft's How to: Search Strings Using Regular Expressions (C# Programming Guide) - it's about regexes in C#.
Upvotes: 0