Reputation: 1

How do I find and list duplicate rows based on columns in a CSV file using C#. Matching/Grouping Rows.

I converted an excel file into a CSV file. The file contains over 100k records. I'm wanting to search and return duplicate rows by searching the full name column. If the full name's match up I want the program to return the entire rows of the duplicates. I started with a code that returns a list of full names but that's about it.

I've listed the code that I have now below:

public static void readCells()
    {
        var dictionary = new Dictionary<string, int>();

        Console.WriteLine("started");
        var counter = 1;
        var readText = File.ReadAllLines(path);
        var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1);

        foreach (var s in readText)
        {
            var values = s.Split(new Char[] { ',' });
            var fullName = values[3];
            if (!dictionary.ContainsKey(fullName))
            {
                dictionary.Add(fullName, 1);
            }
            else
            {
                dictionary[fullName] += 1;

            }
            Console.WriteLine("Full Name Is: " + values[3]);

            counter++;
        }

    }
}

Upvotes: 0

Answers (3)

Cinchoo

Reputation: 6332

You can try out Cinchoo ETL - an open source library to parse CSV file and identify the duplicates with few lines of code.

Sample CSV file (EmpDuplicates.csv) below

Id,Name
1,Tom
2,Mark
3,Lou
3,Lou
4,Austin
4,Austin
4,Austin

Here is how you can parse and identify the duplicate records

using (var parser = new ChoCSVReader("EmpDuplicates.csv").WithFirstLineHeader())
{
    foreach (dynamic c in parser.GroupBy(r => r.Id).Where(g => g.Count() > 1).Select(g => g.FirstOrDefault()))
        Console.WriteLine(c.DumpAsJson());
}

Output:

{
  "Id": 3,
  "Name": "Lou"
}
{
  "Id": 4,
  "Name": "Austin"
}

Hope this helps.

For more detailed usage of this library, visit CodeProject article at https://www.codeproject.com/Articles/1145337/Cinchoo-ETL-CSV-Reader

Upvotes: 1

dbc

Reputation: 116991

I've found that using Microsoft's built-in TextFieldParser (which you can use in c# despite being in the Microsoft.VisualBasic.FileIO namespace) can simplify reading and parsing of CSV files.

Using this type, your method ReadCells() can be modified into the following extension method:

using Microsoft.VisualBasic.FileIO;

public static class TextFieldParserExtensions
{
    public static List<IGrouping<string, string[]>> ReadCellsWithDuplicatedCellValues(string path, int keyCellIndex, int nRowsToSkip /* = 0 */)
    {
        using (var stream = File.OpenRead(path))
        using (var parser = new TextFieldParser(stream))
        {
            parser.SetDelimiters(new string[] { "," });
            var values = parser.ReadAllFields()
                // If your CSV file contains header row(s) you can skip them by passing a value for nRowsToSkip
                .Skip(nRowsToSkip) 
                .GroupBy(row => row.ElementAtOrDefault(keyCellIndex))
                .Where(g => g.Count() > 1)
                .ToList();
            return values;
        }
    }

    public static IEnumerable<string[]> ReadAllFields(this TextFieldParser parser)
    {
        if (parser == null)
            throw new ArgumentNullException();
        while (!parser.EndOfData)
            yield return parser.ReadFields();
    }
}

Which you would call like:

var groups = TextFieldParserExtensions.ReadCellsWithDuplicatedCellValues(path, 3);

Notes:

TextFieldParser correctly handles cells with escaped, embedded commas which s.Split(new Char[] { ',' }) will not.
Since your CSV file has over 100k records I adopted a streaming strategy to avoid the intermediate string[] readText memory allocation.

Upvotes: 1

jdweng

Reputation: 34421

I changed dictionary to use fullname as key :

       public static void readCells()
        {
            var dictionary = new Dictionary<string, List<List<string>>>();

            Console.WriteLine("started");
            var counter = 1;
            var readText = File.ReadAllLines(path);
            var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1);

            foreach (var s in readText)
            {
                List<string> values = s.Split(new Char[] { ',' }).ToList();
                string fullName = values[3];
                if (!dictionary.ContainsKey(fullName))
                {
                    List<List<string>> newList = new List<List<string>>();
                    newList.Add(values);
                    dictionary.Add(fullName, newList);
                }
                else
                {
                    dictionary[fullName].Add(values);

                }
                Console.WriteLine("Full Name Is: " + values[3]);

                counter++;
            }

        }

Upvotes: 1

How do I find and list duplicate rows based on columns in a CSV file using C#. Matching/Grouping Rows.

Answers (3)

Related Questions