Reputation: 57
In Python pandas
, I can easily drop duplicates in a DataFrame with:
df1.drop_duplicates(['Service Date', 'Customer Number'], inplace=True)
Is there anything in C# or Deedle
that's this simple and fast? Or do I need to iterate over the entire frame (from a large CSV file) to drop duplicates?
The data I'm working with is imported from a large CSV file with about 40 columns and 12k rows. For each date, there are multiple entries for Customer Number. I need to eliminate duplicate Customer Number rows (leaving only one unique) per date.
Here's some simplified data, using DATE and RECN as the columns used to de-dupify:
NAME, TYPE, DATE, RECN, COMM
Kermit, Frog, 06/30/14, 1, 1test
Kermit, Frog, 06/30/14, 1, 2test
Ms. Piggy, Pig, 07/01/14, 2, 1test
Fozzy, Bear, 06/29/14, 3, 1test
Kermit, Frog, 07/02/14, 1, 3test
Kermit, Frog, 07/02/14, 1, 4test
Kermit, Frog, 07/02/14, 1, 5test
Ms. Piggy, Pig, 07/02/14, 2, 3test
Fozzy, Bear, 07/02/14, 3, 2test
Ms. Piggy, Pig, 07/02/14, 2, 2test
Upvotes: 0
Views: 2559
Reputation: 6316
Deedle doesn't seem to have that sort of utility in its CSV reader functions. Using another CSV reader to load the data (LumenWorks CSV Reader) I was able to de-duplicate the data using these extension methods:
public static class DeduplicateCsv
{
public static IEnumerable<Series<string, object>> ReadCsv(this string file)
{
// NuGet: PM> Install-Package LumenWorksCsvReader
using (var csv = new CsvReader(new StreamReader(file), true))
{
int fieldCount = csv.FieldCount;
string[] headers = csv.GetFieldHeaders();
while (csv.ReadNextRecord())
{
var seriesBuilder = new SeriesBuilder<string>();
for (int i = 0; i < fieldCount; i++)
{
seriesBuilder.Add(headers[i], csv[i]);
}
yield return seriesBuilder.Series;
}
}
}
public static IEnumerable<TSource> DistinctObject<TSource, TCompare>(this IEnumerable<TSource> source, Func<TSource, TCompare> compare)
{
var set = new HashSet<TCompare>();
return source.Where(element => set.Add(compare(element)));
}
public static IEnumerable<Series<string, object>> DeDupify(this IEnumerable<Series<string, object>> source, string key)
{
return source.DistinctObject(s => s.Get(key));
}
}
Here is how I used it:
var frame = Frame.FromRows("data.csv"
.ReadCsv()
.DeDupify("Service Date")
.DeDupify("Customer Number")
.ToList()
);
frame.Print();
Note that I had to put a .ToList()
at the end since Deedle seems to be running over the IEnumerable
more than once.
Upvotes: 1