Reputation: 368
I am trying to find the top occurrances of words in a string.
e.g.
Hello World This is a great world, This World is simply great
from the above string i am trying to calculate results something like follows:
but ignoring any words with length less then 3 characters e.g. is
which occurred twice.
I tried to look into Dictionary<key, value>
pairs, I tried to look into linq's GroupBy
extension. I know the solution lies somewhere in between but I just can't get my head around the algorithm and how to get this done.
Upvotes: 11
Views: 19428
Reputation: 11
You should be able to do this using Linq
string[] splitString = actualString.Split(' ');
var arrayCount = splitString.GroupBy(a => a);
foreach (var r in arrayCount)
{
Console.WriteLine("This " + r.Key + " appeared " + r.Count() + " times in a string.");
}
This can be solved in many different ways. Link for reference.
Upvotes: 0
Reputation: 4290
const string input = "Hello World This is a great world, This World is simply great";
var words = input
.Split(new[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)
.Where(w => w.Length >= 3)
.GroupBy(w => w)
.OrderByDescending(g => g.Count());
foreach (var word in words)
Console.WriteLine("{0}x {1}", g.Count(), word.Key);
// 2x World
// 2x This
// 2x great
// 1x Hello
// 1x world,
// 1x simply
Not perfect, because it doesn't trim the comma, but it shows you how to do the grouping and filtering at least.
Upvotes: 3
Reputation: 8783
I write a string processor class.You can use it.
Example:
metaKeywords = bodyText.Process(blackListWords: prepositions).OrderByDescending().TakeTop().GetWords().AsString();
Class:
public static class StringProcessor
{
private static List<String> PrepositionList;
public static string ToNormalString(this string strText)
{
if (String.IsNullOrEmpty(strText)) return String.Empty;
char chNormalKaf = (char)1603;
char chNormalYah = (char)1610;
char chNonNormalKaf = (char)1705;
char chNonNormalYah = (char)1740;
string result = strText.Replace(chNonNormalKaf, chNormalKaf);
result = result.Replace(chNonNormalYah, chNormalYah);
return result;
}
public static List<KeyValuePair<String, Int32>> Process(this String bodyText,
List<String> blackListWords = null,
int minimumWordLength = 3,
char splitor = ' ',
bool perWordIsLowerCase = true)
{
string[] btArray = bodyText.ToNormalString().Split(splitor);
long numberOfWords = btArray.LongLength;
Dictionary<String, Int32> wordsDic = new Dictionary<String, Int32>(1);
foreach (string word in btArray)
{
if (word != null)
{
string lowerWord = word;
if (perWordIsLowerCase)
lowerWord = word.ToLower();
var normalWord = lowerWord.Replace(".", "").Replace("(", "").Replace(")", "")
.Replace("?", "").Replace("!", "").Replace(",", "")
.Replace("<br>", "").Replace(":", "").Replace(";", "")
.Replace("،", "").Replace("-", "").Replace("\n", "").Trim();
if ((normalWord.Length > minimumWordLength && !normalWord.IsMemberOfBlackListWords(blackListWords)))
{
if (wordsDic.ContainsKey(normalWord))
{
var cnt = wordsDic[normalWord];
wordsDic[normalWord] = ++cnt;
}
else
{
wordsDic.Add(normalWord, 1);
}
}
}
}
List<KeyValuePair<String, Int32>> keywords = wordsDic.ToList();
return keywords;
}
public static List<KeyValuePair<String, Int32>> OrderByDescending(this List<KeyValuePair<String, Int32>> list, bool isBasedOnFrequency = true)
{
List<KeyValuePair<String, Int32>> result = null;
if (isBasedOnFrequency)
result = list.OrderByDescending(q => q.Value).ToList();
else
result = list.OrderByDescending(q => q.Key).ToList();
return result;
}
public static List<KeyValuePair<String, Int32>> TakeTop(this List<KeyValuePair<String, Int32>> list, Int32 n = 10)
{
List<KeyValuePair<String, Int32>> result = list.Take(n).ToList();
return result;
}
public static List<String> GetWords(this List<KeyValuePair<String, Int32>> list)
{
List<String> result = new List<String>();
foreach (var item in list)
{
result.Add(item.Key);
}
return result;
}
public static List<Int32> GetFrequency(this List<KeyValuePair<String, Int32>> list)
{
List<Int32> result = new List<Int32>();
foreach (var item in list)
{
result.Add(item.Value);
}
return result;
}
public static String AsString<T>(this List<T> list, string seprator = ", ")
{
String result = string.Empty;
foreach (var item in list)
{
result += string.Format("{0}{1}", item, seprator);
}
return result;
}
private static bool IsMemberOfBlackListWords(this String word, List<String> blackListWords)
{
bool result = false;
if (blackListWords == null) return false;
foreach (var w in blackListWords)
{
if (w.ToNormalString().Equals(word))
{
result = true;
break;
}
}
return result;
}
}
Upvotes: 3
Reputation: 2758
So I'd avoid LINQ and Regex and the like since it sounds like you are trying to find an algorithm and understand this not use some function to do it for you.
Not that those are not valid solutions. They are. Definitely.
Try something like this
Dictionary<string, int> dictionary = new Dictionary<string, int>();
string sInput = "Hello World, This is a great World. I love this great World";
sInput = sInput.Replace(",", ""); //Just cleaning up a bit
sInput = sInput.Replace(".", ""); //Just cleaning up a bit
string[] arr = sInput.Split(' '); //Create an array of words
foreach (string word in arr) //let's loop over the words
{
if (word.Length >= 3) //if it meets our criteria of at least 3 letters
{
if (dictionary.ContainsKey(word)) //if it's in the dictionary
dictionary[word] = dictionary[word] + 1; //Increment the count
else
dictionary[word] = 1; //put it in the dictionary with a count 1
}
}
foreach (KeyValuePair<string, int> pair in dictionary) //loop through the dictionary
Response.Write(string.Format("Key: {0}, Pair: {1}<br />",pair.Key,pair.Value));
Upvotes: 6
Reputation: 35407
string words = "Hello World This is a great world, This World is simply great".ToLower();
var results = words.Split(' ').Where(x => x.Length > 3)
.GroupBy(x => x)
.Select(x => new { Count = x.Count(), Word = x.Key })
.OrderByDescending(x => x.Count);
foreach (var item in results)
Console.WriteLine(String.Format("{0} occured {1} times", item.Word, item.Count));
Console.ReadLine();
To get the word with the most occurrences:
results.First().Word;
Upvotes: 2
Reputation: 10221
Using LINQ and Regex
Regex.Split("Hello World This is a great world, This World is simply great".ToLower(), @"\W+")
.Where(s => s.Length > 3)
.GroupBy(s => s)
.OrderByDescending(g => g.Count())
Upvotes: 22