Reputation: 2781
I need to remove words from a string based on a set of words:
Words I want to remove:
DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND
If I receive a string like:
EDIT: This string is already "cleaned" from any symbols
THIS IS AN AMAZING WEBSITE AND LAYOUT
The result should be:
THIS IS AMAZING WEBSITE LAYOUT
So far I have:
public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });
string pattern = "";
foreach (string word in splitWords)
{
pattern = @"\b" + word + "\b";
stringToClean = Regex.Replace(stringToClean, pattern, "");
}
return stringToClean;
}
But it's not removing the words, any idea?
I don't know if I'm using the most eficient way to do it, maybe put the words in a array just to avoid spliting them all the time?
Thanks
Upvotes: 6
Views: 20089
Reputation: 35716
how about,
// make a pattern to match all words
var pattern = string.Format(
@"\b({0})\b",
string.Join("|", wordsToremove.Split(new[] { ' ' })));
// pattern will be of the form "\b(badword1|badword2|...)\b"
// remove all the bad words from the string in one go.
var cleanString = Regex.Replace(stringToClean, pattern, string.Empty);
// normalise the white space in the string (one space at a time)
var normalisedString = Regex.Replace(cleanString, @"\s+", " ");
The first line makes a pattern that matches any of the words to remove. The second line replaces them all at once which saves needless iteration. The third line normalises the white space in the string.
Upvotes: 0
Reputation: 3558
private static List<string> wordsToRemove =
"DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ').ToList();
public static string StringWordsRemove(string stringToClean)
{
return string.Join(" ", stringToClean.Split(' ').Except(wordsToRemove));
}
Modification to handle punctuations:
public static string StringWordsRemove(string stringToClean)
{
// Define how to tokenize the input string, i.e. space only or punctuations also
return string.Join(" ", stringToClean
.Split(new[] { ' ', ',', '.', '?', '!' }, StringSplitOptions.RemoveEmptyEntries)
.Except(wordsToRemove));
}
Upvotes: 9
Reputation: 840
Or...
stringToClean = Regex.Replace(stringToClean, @"\bDE\b|\bDA\b|\bDAS\b|\bDO\b|\bDOS\b|\bAN\b|\bNAS\b|\bNO\b|\bNOS\b|\bEM\b|\bE\b|\bA\b|\bAS\b|\bO\b|\bOS\b|\bAO\b|\bAOS\b|\bP\b|\bLDA\b|\bAND\b", String.Empty);
stringToClean = Regex.Replace(stringToClean, " ", String.Empty);
Upvotes: 0
Reputation: 4352
Output you get "THIS IS AMAZING WEBSITE LAYOUT".
I was getting an issue where by it was leaving the word "D" (so it was THIS IS AN AMAZING WEBSITE D LAYOUT) in the result because if you use replace it replaces only a certain part of the word. This removed the entire word if the characters you defined are detected (I imagine this is what you want?).
string[] tabooWords = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ');
string text = "THIS IS AN AMAZING WEBSITE AND LAYOUT";
string result = text;
foreach (string word in text.Split(' '))
{
if (tabooWords.Contains(word.ToUpper()))
{
int start = result.IndexOf(word);
result = result.Remove(start, word.Length);
}
}
Upvotes: 0
Reputation: 4607
I used LINQ
string exceptions = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND";
string[] exceptionsList = exceptions.Split(' ');
string test ="THIS IS AN AMAZING WEBSITE AND LAYOUT";
string[] wordList = test.Split(' ');
string final = null;
var result = wordList.Except(exceptionsList).ToArray();
final = String.Join(" ",result);
Console.WriteLine(final);
Upvotes: 1
Reputation: 12439
I just changed this line
pattern = @"\b" + word + "\b";
to this
pattern = @"\b" + word + @"\b"; //added '@'
and I got the result
THIS IS AMAZING WEBSITE LAYOUT
and it would be better if you use String.Empty
instead of ""
like:
stringToClean = Regex.Replace(stringToClean, pattern, String.Empty);
Upvotes: 1
Reputation: 31
public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });
string pattern = " (" + string.Join("|", splitWords) + ") ";
string cleaned=Regex.Replace(stringToClean, pattern, " ");
return cleaned;
}
Upvotes: 0