esgtrdhtyjg
esgtrdhtyjg

Reputation: 23

Counting duplicate lines from text files using C#

Hey guys I'm working on a program to take information from a text file and output the information in a CSV file, one thing I need to do is complement a count of the duplicate files (Where possible, duplicate records of an offense charged against an individual should be aggregated into a single record with a addition field called "counts" that indicates the number of duplicate records found (for non-duplicate records, this field should be set to zero).). I've been having a little bit of trouble adding the counter and was wondering if you guys had any advice for me.

Thank you

using System;
using System.IO;
using System.Linq;
using System.Collections.Generic;
using System.Text;

namespace finalproj
{
    class Program
    {
        static void Main(string[] args)
        {
            StreamReader reader = new StreamReader("DISTRICT.DISTRICT_COURT_.11.13.18.AM.000B.CAL.txt");

            StreamWriter writer = new StreamWriter("outtext.csv");

            int counts;
            string line = "";

            for (int x = 0; x < 1; x++)
            {
                string buffer = reader.ReadLine();
                line += " " + buffer;
            }

            //StreamWriter writer = new StreamWriter("outtext.csv");
            //writer.WriteLine(line);
            //writer.Close();

            //Console.WriteLine(line);

            while (line != null)
            {
                if (line.Contains("APT."))
                {
                    Console.WriteLine(line);
                }
                else if (line.Contains("BPD"))
                {
                    Console.WriteLine(line);
                }
                else if (line.Contains("18IF"))
                {
                    Console.WriteLine(line);
                }
                else if (line.Contains("SHP"))
                {
                    Console.WriteLine(line);
                }
                else if (line.Contains("SFF"))
                {
                    Console.WriteLine(line);
                }
                else if (line.Contains("CLS:"))
                {
                    Console.WriteLine(line);
                }
                else if (line.Contains("BOND"))
                {
                    Console.WriteLine(line);
                }
                else if (line.Contains("ATTY"))
                {
                    Console.WriteLine(line);
                }
                else if (line.Contains("(T)"))
                {
                    Console.WriteLine(line);
                }
                else if (line.Contains("(M)"))
                {
                    Console.WriteLine(line);
                }
                else if (line.Contains("(F)"))
                {
                    Console.WriteLine(line);
                }
                else if (line.Contains("(I)"))
                {
                    Console.WriteLine(line);
                }


                line = reader.ReadLine();
                writer.WriteLine(line);
            }


            writer.WriteLine(line);

            reader.Close();
            writer.Close();
            Console.WriteLine(line);


            //using (reader)
            //{
            //    
            //string line1;
            //string[] split = new
            //    while((line1 = reader.ReadLine()) !=null)
            //    {
            //        string[] split = 
            //    }
            //}

            Console.ReadKey();
        }
    }
}

Upvotes: 1

Views: 997

Answers (2)

Aldert
Aldert

Reputation: 4313

Here you go, I used Regex to match what you look for and used a SordedSet to capture the lines and see if there are duplicates. Be aware, whith big files you might use quite some memory but as it is csv related, I think you are fine:

using System;
using System.Collections.Generic;
using System.IO;
using System.Text.RegularExpressions;

namespace ConsoleApp4
{
    class Program
    {

        static void Main(string[] args)
        {
            StreamReader reader = new StreamReader("DISTRICT.DISTRICT_COURT_.11.13.18.AM.000B.CAL.txt");

            StreamWriter writer = new StreamWriter("outtext.csv");

            int counts = 0;
            string line ;

            SortedSet<string> uniqueLine = new SortedSet<string>();

            Regex findWords = new Regex(@"(APT.|BPD|18IF|SHP|SFF|CLS:|BOND|ATTY|\(T\)|\(M\)|\(F\)|\(I\))");

            while ((line = reader.ReadLine()) != null)
            {
                if (uniqueLine.Contains(line))
                {
                    counts++;
                }
                else
                {
                    uniqueLine.Add(line);
                    writer.WriteLine(line);
                }
                Match aMatch = findWords.Match(line);

                if (aMatch.Success)
                {
                    Console.WriteLine(line);
                }

            }

            writer.WriteLine("Count:{0}", counts);
            writer.Close();


            Console.ReadKey();
        }
    }
}

Upvotes: 0

Anu Viswan
Anu Viswan

Reputation: 18155

To split lines and count occurrences, you can Split using NewLine and use Linq

string[] lines = str.Split(new[] { Environment.NewLine },StringSplitOptions.None);
var result = lines.GroupBy(g => g)
            .Select(s => new { Key = s.Key, Count = s.Count()})
            .ToDictionary(d => d.Key, d => d.Count);

The result would have lines that has single occurrence. If you want only duplicate lines

var result = lines.GroupBy(g => g).Where(x=> x.Count()>1)
            .Select(s => new { Key = s.Key, Count = s.Count()})
            .ToDictionary(d => d.Key, d => d.Count);

You can then write the CSV directly from the dictionary

File.WriteAllLines(filePath, result.Select(x=>$"{x.Key},{x.Value},"));

Upvotes: 1

Related Questions