Baas
Baas

Reputation: 46

C# Console... text reading,word counting app

OK so i have a C# Console app that is suppossed to read through a .txt file...and count the distinct words..and it works..BUT I it reads through the file for every distinct word in the file with a 100MB file it goes for days.. What i would like is a way to read through the file once and count all the distinct words. Here is SOME the App so far:

using System;
using System.IO;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Threading;
using System.Diagnostics;
using System.Data;
using System.IO.MemoryMappedFiles;

namespace CompressionApp
{
    class Program
    {
        static void Main(string[] args)
        {
            //read all text
            string FilePath = (@"D:\Test\testing.txt");
            string FullText;
            using (StreamReader streamReader = new StreamReader(FilePath))
            {
                FullText = streamReader.ReadToEnd();
            }
            FileInfo Info = new FileInfo(FilePath);
            int FileSize = Convert.ToInt32(Info.Length);
//some code

            string[] Words = FullText.Split(' ');

            var DistinctWords = new List<string>(Words.Distinct());

//some code

            int P = 0;
            int ID = 0;
            int Length = 0;
            int ByteWorth;
            double Perc;
            double PPerc = 0;
            bool display = false;

            using (var mappedFile1 = MemoryMappedFile.CreateFromFile(FilePath))
            {
                using (Stream mmStream = mappedFile1.CreateViewStream())
                {
                    using (StreamReader sr = new StreamReader(mmStream, ASCIIEncoding.ASCII))
                    {
                        Parallel.ForEach(DistinctWords, new ParallelOptions { MaxDegreeOfParallelism = 1 }, Word =>
                        {
                            DataRow dr = dt.NewRow();
                            string SearchTerm = Word;
                            var MatchQuery = from word in Words
                                             where word == SearchTerm
                                             select word;

                            int WordCount = MatchQuery.Count();
                            Length = SearchTerm.Length;
                            if (Length > 1)
                            {
                                if (WordCount > 1)
                                {
                                    ID = ID + 1;
                                    ByteWorth = (Length * 8) * WordCount;
                                    dr["Word"] = SearchTerm;
                                    dr["Count"] = WordCount;
                                    dr["ID"] = ID;
                                    dr["Length"] = Length;
                                    dr["ByteWorth"] = ByteWorth;
                                    dt.Rows.Add(dr);
                                }
                            }
//some code below

This is the complete App so far...not very tidy i know. But i am new to coding.

Any tips,hints or suggestions are welcome.

Upvotes: 1

Views: 1135

Answers (2)

Jim Mischel
Jim Mischel

Reputation: 134125

So as I understand it, you're getting the distinct words and then for each word you're going through the entire file to count occurrences of that word. My bet is that finding the distinct words takes very little time, but the loop that counts occurrences is taking approximately forever.

You can get the distinct words and their counts with LINQ. Replace this line of code:

var DistinctWords = new List<string>(Words.Distinct());

with

var DistinctWithCount = from word in Words
                        group word by word
                        into g
                        select new {Word = g.Key, Count = g.Count()};

You can then enumerate the words with counts like this:

foreach (var g in DistinctWithCount)
{
    Console.WriteLine("{0},{1}", g.Word, g.Count);
}

Upvotes: 2

Yogee
Yogee

Reputation: 1462

I cannot write the whole logic for you, but here is some pointer.. I am using a dictionary instead of Table. You can build table later from the dictionary. If you want to have id, have complex value type instead of 'int'. That int value currently indicates number of count for that word.

var CheckedWords = new Dictionary<string, int>();

Below is how my code in foreach loop looks like:

                        /*DataRow dr = dt.NewRow();
                        string SearchTerm = Word;
                        var MatchQuery = from word in Words
                                         where word == SearchTerm
                                         select word;

                        int WordCount = MatchQuery.Count();

                        Length = SearchTerm.Length;*/

                        if (Word.Length > 1)
                        {
                            if (!CheckedWords.ContainsKey(Word))
                                CheckedWords.Add(Word,1);
                            else
                                CheckedWords[Word]++;
                        }

Upvotes: 0

Related Questions