Reputation: 46
OK so i have a C# Console app that is suppossed to read through a .txt file...and count the distinct words..and it works..BUT I it reads through the file for every distinct word in the file with a 100MB file it goes for days.. What i would like is a way to read through the file once and count all the distinct words. Here is SOME the App so far:
using System;
using System.IO;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Threading;
using System.Diagnostics;
using System.Data;
using System.IO.MemoryMappedFiles;
namespace CompressionApp
{
class Program
{
static void Main(string[] args)
{
//read all text
string FilePath = (@"D:\Test\testing.txt");
string FullText;
using (StreamReader streamReader = new StreamReader(FilePath))
{
FullText = streamReader.ReadToEnd();
}
FileInfo Info = new FileInfo(FilePath);
int FileSize = Convert.ToInt32(Info.Length);
//some code
string[] Words = FullText.Split(' ');
var DistinctWords = new List<string>(Words.Distinct());
//some code
int P = 0;
int ID = 0;
int Length = 0;
int ByteWorth;
double Perc;
double PPerc = 0;
bool display = false;
using (var mappedFile1 = MemoryMappedFile.CreateFromFile(FilePath))
{
using (Stream mmStream = mappedFile1.CreateViewStream())
{
using (StreamReader sr = new StreamReader(mmStream, ASCIIEncoding.ASCII))
{
Parallel.ForEach(DistinctWords, new ParallelOptions { MaxDegreeOfParallelism = 1 }, Word =>
{
DataRow dr = dt.NewRow();
string SearchTerm = Word;
var MatchQuery = from word in Words
where word == SearchTerm
select word;
int WordCount = MatchQuery.Count();
Length = SearchTerm.Length;
if (Length > 1)
{
if (WordCount > 1)
{
ID = ID + 1;
ByteWorth = (Length * 8) * WordCount;
dr["Word"] = SearchTerm;
dr["Count"] = WordCount;
dr["ID"] = ID;
dr["Length"] = Length;
dr["ByteWorth"] = ByteWorth;
dt.Rows.Add(dr);
}
}
//some code below
This is the complete App so far...not very tidy i know. But i am new to coding.
Any tips,hints or suggestions are welcome.
Upvotes: 1
Views: 1135
Reputation: 134125
So as I understand it, you're getting the distinct words and then for each word you're going through the entire file to count occurrences of that word. My bet is that finding the distinct words takes very little time, but the loop that counts occurrences is taking approximately forever.
You can get the distinct words and their counts with LINQ. Replace this line of code:
var DistinctWords = new List<string>(Words.Distinct());
with
var DistinctWithCount = from word in Words
group word by word
into g
select new {Word = g.Key, Count = g.Count()};
You can then enumerate the words with counts like this:
foreach (var g in DistinctWithCount)
{
Console.WriteLine("{0},{1}", g.Word, g.Count);
}
Upvotes: 2
Reputation: 1462
I cannot write the whole logic for you, but here is some pointer.. I am using a dictionary instead of Table. You can build table later from the dictionary. If you want to have id, have complex value type instead of 'int'. That int value currently indicates number of count for that word.
var CheckedWords = new Dictionary<string, int>();
Below is how my code in foreach loop looks like:
/*DataRow dr = dt.NewRow();
string SearchTerm = Word;
var MatchQuery = from word in Words
where word == SearchTerm
select word;
int WordCount = MatchQuery.Count();
Length = SearchTerm.Length;*/
if (Word.Length > 1)
{
if (!CheckedWords.ContainsKey(Word))
CheckedWords.Add(Word,1);
else
CheckedWords[Word]++;
}
Upvotes: 0