Reputation: 1
I converted an excel file into a CSV file. The file contains over 100k records. I'm wanting to search and return duplicate rows by searching the full name column. If the full name's match up I want the program to return the entire rows of the duplicates. I started with a code that returns a list of full names but that's about it.
I've listed the code that I have now below:
public static void readCells()
{
var dictionary = new Dictionary<string, int>();
Console.WriteLine("started");
var counter = 1;
var readText = File.ReadAllLines(path);
var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1);
foreach (var s in readText)
{
var values = s.Split(new Char[] { ',' });
var fullName = values[3];
if (!dictionary.ContainsKey(fullName))
{
dictionary.Add(fullName, 1);
}
else
{
dictionary[fullName] += 1;
}
Console.WriteLine("Full Name Is: " + values[3]);
counter++;
}
}
}
Upvotes: 0
Views: 1656
Reputation: 6332
You can try out Cinchoo ETL - an open source library to parse CSV file and identify the duplicates with few lines of code.
Sample CSV file (EmpDuplicates.csv) below
Id,Name
1,Tom
2,Mark
3,Lou
3,Lou
4,Austin
4,Austin
4,Austin
Here is how you can parse and identify the duplicate records
using (var parser = new ChoCSVReader("EmpDuplicates.csv").WithFirstLineHeader())
{
foreach (dynamic c in parser.GroupBy(r => r.Id).Where(g => g.Count() > 1).Select(g => g.FirstOrDefault()))
Console.WriteLine(c.DumpAsJson());
}
Output:
{
"Id": 3,
"Name": "Lou"
}
{
"Id": 4,
"Name": "Austin"
}
Hope this helps.
For more detailed usage of this library, visit CodeProject article at https://www.codeproject.com/Articles/1145337/Cinchoo-ETL-CSV-Reader
Upvotes: 1
Reputation: 116991
I've found that using Microsoft's built-in TextFieldParser
(which you can use in c# despite being in the Microsoft.VisualBasic.FileIO
namespace) can simplify reading and parsing of CSV files.
Using this type, your method ReadCells()
can be modified into the following extension method:
using Microsoft.VisualBasic.FileIO;
public static class TextFieldParserExtensions
{
public static List<IGrouping<string, string[]>> ReadCellsWithDuplicatedCellValues(string path, int keyCellIndex, int nRowsToSkip /* = 0 */)
{
using (var stream = File.OpenRead(path))
using (var parser = new TextFieldParser(stream))
{
parser.SetDelimiters(new string[] { "," });
var values = parser.ReadAllFields()
// If your CSV file contains header row(s) you can skip them by passing a value for nRowsToSkip
.Skip(nRowsToSkip)
.GroupBy(row => row.ElementAtOrDefault(keyCellIndex))
.Where(g => g.Count() > 1)
.ToList();
return values;
}
}
public static IEnumerable<string[]> ReadAllFields(this TextFieldParser parser)
{
if (parser == null)
throw new ArgumentNullException();
while (!parser.EndOfData)
yield return parser.ReadFields();
}
}
Which you would call like:
var groups = TextFieldParserExtensions.ReadCellsWithDuplicatedCellValues(path, 3);
Notes:
TextFieldParser
correctly handles cells with escaped, embedded commas which s.Split(new Char[] { ',' })
will not.
Since your CSV file has over 100k records I adopted a streaming strategy to avoid the intermediate string[] readText
memory allocation.
Upvotes: 1
Reputation: 34421
I changed dictionary to use fullname as key :
public static void readCells()
{
var dictionary = new Dictionary<string, List<List<string>>>();
Console.WriteLine("started");
var counter = 1;
var readText = File.ReadAllLines(path);
var duplicatedValues = dictionary.GroupBy(fullName => fullName.Value).Where(fullName => fullName.Count() > 1);
foreach (var s in readText)
{
List<string> values = s.Split(new Char[] { ',' }).ToList();
string fullName = values[3];
if (!dictionary.ContainsKey(fullName))
{
List<List<string>> newList = new List<List<string>>();
newList.Add(values);
dictionary.Add(fullName, newList);
}
else
{
dictionary[fullName].Add(values);
}
Console.WriteLine("Full Name Is: " + values[3]);
counter++;
}
}
Upvotes: 1