Tropical
Tropical

Reputation: 23

Loop through string and remove any occurrence of specified word

I'm trying to remove all conjunctions and pronouns from any array of strings(let call that array A), The words to be removed are read from a text file and converted into an array of strings(lets call that array B).

What I need is to Get the first element of array A and compare it to every word in array B, if the word matches I want to delete the word of array A.

For example:

array A = [0]I [1]want [2]to [3]go [4]home [5]and [6]sleep
array B = [0]I [1]and [2]go [3]to

Result= array A = [0]want [1]home [2]sleep

//remove any duplicates,conjunctions and Pronouns
        public IQueryable<All_Articles> removeConjunctionsProNouns(IQueryable<All_Articles> myArticles)
        {
            //get words to be removed
            string text = System.IO.File.ReadAllText("A:\\EnterpriceAssigment\\EnterpriceAssigment\\TextFiles\\conjunctions&ProNouns.txt").ToLower();
            //split word into array of strings 
            string[] wordsToBeRemoved = text.Split(',');
            //all articles
            foreach (var article in myArticles)
            {
               //split articles into words
                string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
                //loop through array of articles words
                foreach (var y in articleSplit)
                {
                    //loop through words to be removed from articleSplit
                    foreach (var x in wordsToBeRemoved)
                    {
                        //if word of articles matches word to be removed, remove word from article
                        if (y == x)
                        {
                            //get index of element in array to be removed
                            int g = Array.IndexOf(articleSplit,y);
                            //assign elemnt to ""
                            articleSplit[g] = "";
                        }
                    }
                }
                //re-assign splitted article to string
                article.ArticleContent = articleSplit.ToString();
            }
            return myArticles;
        }

If it is possible as well, I need array A to have no duplicates/distinct values.

Upvotes: 0

Views: 1466

Answers (3)

Dmitrii Bychenko
Dmitrii Bychenko

Reputation: 186678

You want to remove stop words. You can do it with a help of Linq:

  ...
  string filePath = @"A:\EnterpriceAssigment\EnterpriceAssigment\TextFiles\conjunctions & ProNouns.txt";

  // Hashset is much more efficient than array in the context
  HashSet<string> stopWords = new HashSet<string>(File
    .ReadLines(filePath), StringComparer.OrdinalIgnoreCase);

  foreach (var article in myArticles) {
    // read article, split into words, filter out stop words... 
    var cleared = article
      .ArticleContent
      .Split(' ')
      .Where(word => !stopWords.Contains(word));

    // ...and join words back into article
    article.ArticleContent = string.Join(" ", cleared);  
  }
  ...

Please, notice, that I've preserved Split() which you've used in your code and so you have a toy implementation. In real life you have at least to take punctuation into consideration, and that's why a better code uses regular expressions:

  foreach (var article in myArticles) {
    // read article, extract words, filter out stop words... 
    var cleared = Regex
      .Matches(article.ArticleContent, @"\w+") // <- extract words
      .OfType<Match>()
      .Select(match => match.Value)
      .Where(word => !stopWords.Contains(word));

    // ...and join words back into article
    article.ArticleContent = string.Join(" ", cleared);  
  }

Upvotes: 0

Steve
Steve

Reputation: 216293

You are looking for IEnumerable.Except, where the passed parameter is applied to the input sequence and every element of the input sequence not present in the parameter list is returned only once

For example

string inputText = "I want this string to be returned without some words , but words should have only one occurence";
string[] excludedWords = new string[] {"I","to","be", "some", "but", "should", "have", "one", ","};

var splitted = inputText.Split(' ');
var result = splitted.Except(excludedWords);
foreach(string s in result)
    Console.WriteLine(s);

// ---- Output ----
want
this
string
returned
without
words   <<-- This appears only once
only
occurence

And applied to your code is:

string text = File.ReadAllText(......).ToLower();
string[] wordsToBeRemoved = text.Split(',');
foreach (var article in myArticles)
{
    string[] articleSplit = article.ArticleContent.ToLower().Split(' ');
    var result = articleSplit.Except(wordsToBeRemoved);
    article.ArticleContent = string.Join(" ", result);
}

Upvotes: 2

Dylan Wright
Dylan Wright

Reputation: 1202

You may have your answer already in your code. I am sure your code could be cleaned up a bit, as all our code could be. You loop through articleSplit and pull out each word. Then compare that word to the words in the wordsToBeRemoved array in a loop one by one. You use your conditional to compare and when true you remove the items from your original array, or at least try.

I would create another array of the results and then display, use or what ever you'd like with the array minus the words to exclude. Loop through articleSplit foreach x in arcticle split foreach y in wordsToBeRemoved if x != y newArray.Add(x)

However this is quite a bit of work. You may want to use array.filter and then add that way. There is a hundred ways to achieve this.

Here are some helpful articles: filter an array in C# https://msdn.microsoft.com/en-us/library/d9hy2xwa(v=vs.110).aspx These will save you from all that looping.

Upvotes: 0

Related Questions