Reputation: 1034
I’m just so close, but my program is still not working properly. I am trying to count how many times a set of words appear in a text file, list those words and their individual count and then give a sum of all the found matched words.
If there are 3 instances of “lorem”, 2 instances of “ipsum”, then the total should be 5. My sample text file is simply a paragraph of “Lorem ipsum” repeated a few times in a text file.
My problem is that this code I have so far, only counts the first occurrence of each word, even though each word is repeated several times throughout the text file.
I am using a “pay for” parser called “GroupDocs.Parser” that I added through the NuGet package manager. I would prefer not to use a paid for version if possible.
Is there an easier way to do this in C#?
Here’s a screen shot of my desired results.
Here is the full code that I have so far.
using GroupDocs.Parser;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
namespace ConsoleApp5
{
class Program
{
static void Main(string[] args)
{
using (Parser parser = new Parser(@"E:\testdata\loremIpsum.txt"))
{
// Extract a text into the reader
using (TextReader reader = parser.GetText())
{
// Define the search terms.
string[] wordsToMatch = { "Lorem", "ipsum", "amet" };
Dictionary<string, int> stats = new Dictionary<string, int>();
string text = reader.ReadToEnd();
char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
// split words
string[] words = text.Split(chars);
int minWordLength = 2;// to count words having more than 2 characters
// iterate over the word collection to count occurrences
foreach (string word in wordsToMatch)
{
string w = word.Trim().ToLower();
if (w.Length > minWordLength)
{
if (!stats.ContainsKey(w))
{
// add new word to collection
stats.Add(w, 1);
}
else
{
// update word occurrence count
stats[w] += 1;
}
}
}
// order the collection by word count
var orderedStats = stats.OrderByDescending(x => x.Value);
// print occurrence of each word
foreach (var pair in orderedStats)
{
Console.WriteLine("Total occurrences of {0}: {1}", pair.Key, pair.Value);
}
// print total word count
Console.WriteLine("Total word count: {0}", stats.Count);
Console.ReadKey();
}
}
}
}
}
Any suggestions on what I'm doing wrong?
Thanks in advance.
Upvotes: 1
Views: 2281
Reputation: 4695
Splitting the entire content of the text file to get a string array of the words is not a good idea because doing so will create a new string object in memory for each word. You can imagine the cost when you deal with big files.
An alternative approach is:
using System;
using System.Collections.Concurrent;
using System.Linq;
using System.IO;
using System.Threading.Tasks;
using System.Text.RegularExpressions;
static void Main(string[] args)
{
var file = @"loremIpsum.txt";
var obj = new object();
var wordsToMatch = new ConcurrentDictionary<string, int>();
wordsToMatch.TryAdd("Lorem", 0);
wordsToMatch.TryAdd("ipsum", 0);
wordsToMatch.TryAdd("amet", 0);
Console.WriteLine("Press a key to continue...");
Console.ReadKey();
Parallel.ForEach(File.ReadLines(file),
(line) =>
{
foreach (var word in wordsToMatch.Keys)
lock (obj)
wordsToMatch[word] += Regex.Matches(line, word,
RegexOptions.IgnoreCase).Count;
});
foreach (var kv in wordsToMatch.OrderByDescending(x => x.Value))
Console.WriteLine($"Total occurrences of {kv.Key}: {kv.Value}");
Console.WriteLine($"Total word count: {wordsToMatch.Values.Sum()}");
Console.ReadKey();
}
Upvotes: 1
Reputation: 131730
You can replace this code with a LINQ query that uses case-insensitive grouping. Eg:
char[] chars = { ' ', '.', ',', ';', ':', '?', '\n', '\r' };
var text=File.ReadAllText(somePath);
var query=text.Split(chars)
.GroupBy(w=>w,StringComparer.OrdinalIgnoreCase)
.Select(g=>new {word=g.Key,count=g.Count())
.Where(stat=>stat.count>2)
.OrderByDescending(stat=>stat.count);
At that point you can iterate over the query or copy the results to an array or dictionary with ToArray()
, ToList()
or ToDictionary()
.
This isn't the most efficient code - for one thing, the entire file is loaded in memory. One could use File.ReadLines
to load and iterate over the lines one by one. LINQ could be used to iterate over the lines as well:
var lines=File.ReadLines(somePath);
var query=lines.SelectMany(line=>line.Split(chars))
.GroupBy(w=>w,StringComparer.OrdinalIgnoreCase)
.Select(g=>new {word=g.Key,count=g.Count())
.Where(stat=>stat.count>2)
.OrderByDescending(stat=>stat.count);
Upvotes: 0
Reputation: 156728
stats
is a dictionary, so stats.Count
will only tell you how many distinct words there are. You need to add up all the values in it. Something like stats.Values.Sum()
.
Upvotes: 0