Patrick
Patrick

Reputation: 2781

Remove words in string from words in array with c#

I need to remove words from a string based on a set of words:

Words I want to remove:

DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND

If I receive a string like:

EDIT: This string is already "cleaned" from any symbols

THIS IS AN AMAZING WEBSITE AND LAYOUT

The result should be:

THIS IS AMAZING WEBSITE LAYOUT

So far I have:

public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
    string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });

    string pattern = "";

    foreach (string word in splitWords)
    {
        pattern = @"\b" + word + "\b";
        stringToClean = Regex.Replace(stringToClean, pattern, "");
    }

    return stringToClean;
}

But it's not removing the words, any idea?

I don't know if I'm using the most eficient way to do it, maybe put the words in a array just to avoid spliting them all the time?

Thanks

Upvotes: 6

Views: 20089

Answers (7)

Jodrell
Jodrell

Reputation: 35716

how about,

// make a pattern to match all words 
var pattern = string.Format(
    @"\b({0})\b",
    string.Join("|", wordsToremove.Split(new[] { ' ' })));

// pattern will be of the form "\b(badword1|badword2|...)\b"

// remove all the bad words from the string in one go.    
var cleanString = Regex.Replace(stringToClean, pattern, string.Empty);

// normalise the white space in the string (one space at a time)
var normalisedString = Regex.Replace(cleanString, @"\s+", " ");

The first line makes a pattern that matches any of the words to remove. The second line replaces them all at once which saves needless iteration. The third line normalises the white space in the string.

Upvotes: 0

Fung
Fung

Reputation: 3558

private static List<string> wordsToRemove =
    "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ').ToList();

public static string StringWordsRemove(string stringToClean)
{
    return string.Join(" ", stringToClean.Split(' ').Except(wordsToRemove));
}

Modification to handle punctuations:

public static string StringWordsRemove(string stringToClean)
{
    // Define how to tokenize the input string, i.e. space only or punctuations also
    return string.Join(" ", stringToClean
        .Split(new[] { ' ', ',', '.', '?', '!' }, StringSplitOptions.RemoveEmptyEntries)
        .Except(wordsToRemove));
}

Upvotes: 9

James R.
James R.

Reputation: 840

Or...

stringToClean = Regex.Replace(stringToClean, @"\bDE\b|\bDA\b|\bDAS\b|\bDO\b|\bDOS\b|\bAN\b|\bNAS\b|\bNO\b|\bNOS\b|\bEM\b|\bE\b|\bA\b|\bAS\b|\bO\b|\bOS\b|\bAO\b|\bAOS\b|\bP\b|\bLDA\b|\bAND\b", String.Empty);
stringToClean = Regex.Replace(stringToClean, "  ", String.Empty);

Upvotes: 0

Dr Schizo
Dr Schizo

Reputation: 4352

Output you get "THIS IS AMAZING WEBSITE LAYOUT".

I was getting an issue where by it was leaving the word "D" (so it was THIS IS AN AMAZING WEBSITE D LAYOUT) in the result because if you use replace it replaces only a certain part of the word. This removed the entire word if the characters you defined are detected (I imagine this is what you want?).

        string[] tabooWords = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND".Split(' ');
        string text = "THIS IS AN AMAZING WEBSITE AND LAYOUT";
        string result = text;

        foreach (string word in text.Split(' '))
        {
            if (tabooWords.Contains(word.ToUpper()))
            {
                int start = result.IndexOf(word);
                result = result.Remove(start, word.Length);
            }
        }

Upvotes: 0

Lotok
Lotok

Reputation: 4607

I used LINQ

string exceptions = "DE DA DAS DO DOS AN NAS NO NOS EM E A AS O OS AO AOS P LDA AND";
string[] exceptionsList = exceptions.Split(' ');

string test  ="THIS IS AN AMAZING WEBSITE AND LAYOUT";
string[] wordList = test.Split(' ');

string final = null;
var result = wordList.Except(exceptionsList).ToArray();
final = String.Join(" ",result);

Console.WriteLine(final);

Upvotes: 1

Shaharyar
Shaharyar

Reputation: 12439

I just changed this line

pattern = @"\b" + word + "\b";

to this

pattern = @"\b" + word + @"\b"; //added '@' 

and I got the result

THIS IS AMAZING WEBSITE LAYOUT

and it would be better if you use String.Empty instead of "" like:

stringToClean = Regex.Replace(stringToClean, pattern, String.Empty);

Upvotes: 1

Anderung
Anderung

Reputation: 31

public static string StringWordsRemove(string stringToClean, string wordsToRemove)
{
    string[] splitWords = wordsToRemove.Split(new Char[] { ' ' });
    string pattern = " (" + string.Join("|", splitWords) + ") ";
    string cleaned=Regex.Replace(stringToClean, pattern, " ");
    return cleaned;
}

Upvotes: 0

Related Questions