Thr3e
Thr3e

Reputation: 368

Find the highest occuring words in a string C#

I am trying to find the top occurrances of words in a string.

e.g.

Hello World This is a great world, This World is simply great

from the above string i am trying to calculate results something like follows:

but ignoring any words with length less then 3 characters e.g. is which occurred twice.

I tried to look into Dictionary<key, value> pairs, I tried to look into linq's GroupBy extension. I know the solution lies somewhere in between but I just can't get my head around the algorithm and how to get this done.

Upvotes: 11

Views: 19428

Answers (6)

Paresh Bhatt
Paresh Bhatt

Reputation: 11

You should be able to do this using Linq

 string[] splitString = actualString.Split(' ');
            var arrayCount = splitString.GroupBy(a => a);
            foreach (var r in arrayCount)
            {
                Console.WriteLine("This " + r.Key + " appeared " + r.Count() + "  times in a string.");
            }

This can be solved in many different ways. Link for reference.

Upvotes: 0

Tatham Oddie
Tatham Oddie

Reputation: 4290

const string input = "Hello World This is a great world, This World is simply great";
var words = input
    .Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
    .Where(w => w.Length >= 3)
    .GroupBy(w => w)
    .OrderByDescending(g => g.Count());

foreach (var word in words)
    Console.WriteLine("{0}x {1}", g.Count(), word.Key);

// 2x World
// 2x This
// 2x great
// 1x Hello
// 1x world,
// 1x simply

Not perfect, because it doesn't trim the comma, but it shows you how to do the grouping and filtering at least.

Upvotes: 3

x19
x19

Reputation: 8783

I write a string processor class.You can use it.

Example:

metaKeywords = bodyText.Process(blackListWords: prepositions).OrderByDescending().TakeTop().GetWords().AsString();

Class:

 public static class StringProcessor
{
    private static List<String> PrepositionList;

    public static string ToNormalString(this string strText)
    {
        if (String.IsNullOrEmpty(strText)) return String.Empty;
        char chNormalKaf = (char)1603;
        char chNormalYah = (char)1610;
        char chNonNormalKaf = (char)1705;
        char chNonNormalYah = (char)1740;
        string result = strText.Replace(chNonNormalKaf, chNormalKaf);
        result = result.Replace(chNonNormalYah, chNormalYah);
        return result;
    }

    public static List<KeyValuePair<String, Int32>> Process(this String bodyText,
        List<String> blackListWords = null,
        int minimumWordLength = 3,
        char splitor = ' ',
        bool perWordIsLowerCase = true)
    {
        string[] btArray = bodyText.ToNormalString().Split(splitor);
        long numberOfWords = btArray.LongLength;
        Dictionary<String, Int32> wordsDic = new Dictionary<String, Int32>(1);
        foreach (string word in btArray)
        {
            if (word != null)
            {
                string lowerWord = word;
                if (perWordIsLowerCase)
                    lowerWord = word.ToLower();
                var normalWord = lowerWord.Replace(".", "").Replace("(", "").Replace(")", "")
                    .Replace("?", "").Replace("!", "").Replace(",", "")
                    .Replace("<br>", "").Replace(":", "").Replace(";", "")
                    .Replace("،", "").Replace("-", "").Replace("\n", "").Trim();
                if ((normalWord.Length > minimumWordLength && !normalWord.IsMemberOfBlackListWords(blackListWords)))
                {
                    if (wordsDic.ContainsKey(normalWord))
                    {
                        var cnt = wordsDic[normalWord];
                        wordsDic[normalWord] = ++cnt;
                    }
                    else
                    {
                        wordsDic.Add(normalWord, 1);
                    }
                }
            }
        }
        List<KeyValuePair<String, Int32>> keywords = wordsDic.ToList();
        return keywords;
    }

    public static List<KeyValuePair<String, Int32>> OrderByDescending(this List<KeyValuePair<String, Int32>> list, bool isBasedOnFrequency = true)
    {
        List<KeyValuePair<String, Int32>> result = null;
        if (isBasedOnFrequency)
            result = list.OrderByDescending(q => q.Value).ToList();
        else
            result = list.OrderByDescending(q => q.Key).ToList();
        return result;
    }

    public static List<KeyValuePair<String, Int32>> TakeTop(this List<KeyValuePair<String, Int32>> list, Int32 n = 10)
    {
        List<KeyValuePair<String, Int32>> result = list.Take(n).ToList();
        return result;
    }

    public static List<String> GetWords(this List<KeyValuePair<String, Int32>> list)
    {
        List<String> result = new List<String>();
        foreach (var item in list)
        {
            result.Add(item.Key);
        }
        return result;
    }

    public static List<Int32> GetFrequency(this List<KeyValuePair<String, Int32>> list)
    {
        List<Int32> result = new List<Int32>();
        foreach (var item in list)
        {
            result.Add(item.Value);
        }
        return result;
    }

    public static String AsString<T>(this List<T> list, string seprator = ", ")
    {
        String result = string.Empty;
        foreach (var item in list)
        {
            result += string.Format("{0}{1}", item, seprator);
        }
        return result;
    }

    private static bool IsMemberOfBlackListWords(this String word, List<String> blackListWords)
    {
        bool result = false;
        if (blackListWords == null) return false;
        foreach (var w in blackListWords)
        {
            if (w.ToNormalString().Equals(word))
            {
                result = true;
                break;
            }
        }
        return result;
    }
}

Upvotes: 3

Jordan
Jordan

Reputation: 2758

So I'd avoid LINQ and Regex and the like since it sounds like you are trying to find an algorithm and understand this not use some function to do it for you.

Not that those are not valid solutions. They are. Definitely.

Try something like this

Dictionary<string, int> dictionary = new Dictionary<string, int>();

string sInput = "Hello World, This is a great World. I love this great World";
sInput = sInput.Replace(",", ""); //Just cleaning up a bit
sInput = sInput.Replace(".", ""); //Just cleaning up a bit
string[] arr = sInput.Split(' '); //Create an array of words

foreach (string word in arr) //let's loop over the words
{
    if (word.Length >= 3) //if it meets our criteria of at least 3 letters
    {
        if (dictionary.ContainsKey(word)) //if it's in the dictionary
            dictionary[word] = dictionary[word] + 1; //Increment the count
        else
            dictionary[word] = 1; //put it in the dictionary with a count 1
     }
}

foreach (KeyValuePair<string, int> pair in dictionary) //loop through the dictionary
    Response.Write(string.Format("Key: {0}, Pair: {1}<br />",pair.Key,pair.Value));

Upvotes: 6

Alex
Alex

Reputation: 35407

string words = "Hello World This is a great world, This World is simply great".ToLower();

var results = words.Split(' ').Where(x => x.Length > 3)
                              .GroupBy(x => x)
                              .Select(x => new { Count = x.Count(), Word = x.Key })
                              .OrderByDescending(x => x.Count);

foreach (var item in results)
    Console.WriteLine(String.Format("{0} occured {1} times", item.Word, item.Count));

Console.ReadLine();

To get the word with the most occurrences:

results.First().Word;

Upvotes: 2

Ilia G
Ilia G

Reputation: 10221

Using LINQ and Regex

Regex.Split("Hello World This is a great world, This World is simply great".ToLower(), @"\W+")
    .Where(s => s.Length > 3)
    .GroupBy(s => s)
    .OrderByDescending(g => g.Count())

Upvotes: 22

Related Questions