Reputation: 1034

Count how many times certain words appear in a text with C#

I’m just so close, but my program is still not working properly. I am trying to count how many times a set of words appear in a text file, list those words and their individual count and then give a sum of all the found matched words.

If there are 3 instances of “lorem”, 2 instances of “ipsum”, then the total should be 5. My sample text file is simply a paragraph of “Lorem ipsum” repeated a few times in a text file.

My problem is that this code I have so far, only counts the first occurrence of each word, even though each word is repeated several times throughout the text file.

I am using a “pay for” parser called “GroupDocs.Parser” that I added through the NuGet package manager. I would prefer not to use a paid for version if possible.

Is there an easier way to do this in C#?

Here’s a screen shot of my desired results.

Here is the full code that I have so far.

using GroupDocs.Parser;
using System;

using System.Collections.Generic;

using System.IO;

using System.Linq;


namespace ConsoleApp5

{
    class Program
    {
        static void Main(string[] args)
        {

            using (Parser parser = new Parser(@"E:\testdata\loremIpsum.txt"))
            {

                // Extract a text into the reader
                using (TextReader reader = parser.GetText())

                   

                {
                    // Define the search terms. 
                    string[] wordsToMatch = { "Lorem", "ipsum", "amet" };

                    Dictionary<string, int> stats = new Dictionary<string, int>();
                    string text = reader.ReadToEnd();
                    char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
                    // split words
                    string[] words = text.Split(chars);
                    int minWordLength = 2;// to count words having more than 2 characters

                    // iterate over the word collection to count occurrences
                    foreach (string word in wordsToMatch)
                    {
                        string w = word.Trim().ToLower();
                        if (w.Length > minWordLength)
                        {
                            if (!stats.ContainsKey(w))
                            {
                                // add new word to collection
                                stats.Add(w, 1);
                            }
                            else
                            {
                                // update word occurrence count
                                stats[w] += 1;
                            }
                        }
                    }

                    // order the collection by word count
                    var orderedStats = stats.OrderByDescending(x => x.Value);


                    // print occurrence of each word
                    foreach (var pair in orderedStats)
                    {
                        Console.WriteLine("Total occurrences of {0}: {1}", pair.Key, pair.Value);

                    }
                    // print total word count
                    Console.WriteLine("Total word count: {0}", stats.Count);
                    Console.ReadKey();
                }
            }
        }
    }
}

Any suggestions on what I'm doing wrong?

Thanks in advance.

Upvotes: 1

Answers (3)

dr.null

Reputation: 4695

Splitting the entire content of the text file to get a string array of the words is not a good idea because doing so will create a new string object in memory for each word. You can imagine the cost when you deal with big files.

An alternative approach is:

Use the Parallel.ForEach method to read the lines from the text file in parallel.
Use the thread-safe ConcurrentDictionary<TKey,TValue> collection to be accessed by the paralleled threads.
Increment the values of each word (key) by the count of the Regex.Matches Method.

using System;
using System.Collections.Concurrent;
using System.Linq;
using System.IO;
using System.Threading.Tasks;
using System.Text.RegularExpressions;

static void Main(string[] args)
{
    var file = @"loremIpsum.txt";            
    var obj = new object();
    var wordsToMatch = new ConcurrentDictionary<string, int>();

    wordsToMatch.TryAdd("Lorem", 0);
    wordsToMatch.TryAdd("ipsum", 0);
    wordsToMatch.TryAdd("amet", 0);

    Console.WriteLine("Press a key to continue...");
    Console.ReadKey();

    Parallel.ForEach(File.ReadLines(file),
        (line) =>
        {
            foreach (var word in wordsToMatch.Keys)
                lock (obj)
                    wordsToMatch[word] += Regex.Matches(line, word, 
                        RegexOptions.IgnoreCase).Count;
        });

    foreach (var kv in wordsToMatch.OrderByDescending(x => x.Value))
        Console.WriteLine($"Total occurrences of {kv.Key}: {kv.Value}");

    Console.WriteLine($"Total word count: {wordsToMatch.Values.Sum()}");
    Console.ReadKey();
}

Upvotes: 1

Panagiotis Kanavos

Reputation: 131730

You can replace this code with a LINQ query that uses case-insensitive grouping. Eg:

char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
var text=File.ReadAllText(somePath);
var query=text.Split(chars)
              .GroupBy(w=>w,StringComparer.OrdinalIgnoreCase)
              .Select(g=>new {word=g.Key,count=g.Count())
              .Where(stat=>stat.count>2)
              .OrderByDescending(stat=>stat.count);

At that point you can iterate over the query or copy the results to an array or dictionary with ToArray(), ToList() or ToDictionary().

This isn't the most efficient code - for one thing, the entire file is loaded in memory. One could use File.ReadLines to load and iterate over the lines one by one. LINQ could be used to iterate over the lines as well:

var lines=File.ReadLines(somePath);
var query=lines.SelectMany(line=>line.Split(chars))
              .GroupBy(w=>w,StringComparer.OrdinalIgnoreCase)
              .Select(g=>new {word=g.Key,count=g.Count())
              .Where(stat=>stat.count>2)
              .OrderByDescending(stat=>stat.count);

Upvotes: 0

StriplingWarrior

Reputation: 156728

stats is a dictionary, so stats.Count will only tell you how many distinct words there are. You need to add up all the values in it. Something like stats.Values.Sum().

Upvotes: 0

Count how many times certain words appear in a text with C#

Answers (3)

Related Questions