star2014
star2014

Reputation: 183

how to remove duplicate data for group query in Linq

I'm trying to find a distinct list of filenames related to each bugid, and I used linq to group all filenames related to each bug id. I don't know how I can remove duplicate filenames related to each bugid,in file ouput I have multiple rows like this: bugid filename1 filename2 filename3 filename4 ............. there are multiple rows with the same bugid and also there duplicate filenames for each bug id, this is my code:

using System;
using System.Collections.Generic;
using System.Text;
using System.Linq;


namespace finalgroupquery
{
    class MainClass
{
        public static void Main (string[] args)
        {

            List <bug> list2=new List <bug> ();
             using(System.IO.StreamReader reader1= new System.IO.StreamReader( @"/home/output"))
                using (System.IO.StreamWriter file = new System.IO.StreamWriter( @"/home/output1")) 
                        {string line1;
                         while ((line1=reader1.ReadLine())!=null) 
                            { string[] items1=line1.Split('\t');        
                                    bug bg=new bug();
                                      bg.bugid=items1[0];
                                for (int i=1; i<=items1.Length -1;i++)
                                    { bg.list1.Add(items1[i]);}
                                            list2.Add(bg);
                            }

                            var bugquery= from c in list2 group c by c.bugid into x select
                                            new Container { BugID = x.Key, Grouped = x };



                            foreach (Container con in bugquery)
                            {
                                StringBuilder files = new StringBuilder();
                                files.Append(con.BugID);
                                files.Append("\t");

                                foreach(var x in con.Grouped)
                                {
                                    files.Append(string.Join("\t", x.list1.ToArray()));
                                }

                                file.WriteLine(files.ToString());       }


            }
        }
    }

    public class Container
    {
        public string BugID {get;set;}
        public IGrouping<string, bug> Grouped {get;set;}
    }

    public class bug
    { 
        public List<string> list1{get; set;}
        public string bugid{get; set;}

        public bug()
        {
            list1=new List<string>();
        }       


    }
}


}

Upvotes: 1

Views: 1599

Answers (2)

AirL
AirL

Reputation: 1907

Try to use this code :

        var bugquery = from c in list2
                        group c by c.bugid into x
                        select new bug { bugid = x.Key, list1 = x.SelectMany(l => l.list1).Distinct().ToList() };

        foreach (bug bug in bugquery)
        {
            StringBuilder files = new StringBuilder();
            files.Append(bug.bugid);
            files.Append("\t");
            files.Append(string.Join("\t", bug.list1.ToArray()));

            file.WriteLine(files.ToString());
        }

Thanks to the combination of SelectMany and Distinct Linq operators, you can flatten the filename list and delete duplicates in a single line.

SelectMany (from msdn):

Projects each element of a sequence to an IEnumerable and flattens the resulting sequences into one sequence.

Distinct (from msdn):

Returns distinct elements from a sequence.

It also means that your Container class is no longer needed as there's no need to iterate through the IGrouping<string, bug> collection anymore (here list1 contains all the bug related filenames without duplicates).

Edit

As you may have some blank lines and/or empty strings after reading and parsing your file, you could use this code to get rid of them :

        using (System.IO.StreamReader reader1 = new System.IO.StreamReader(@"/home/sunshine40270/mine/projects/interaction2/fasil-data/common history/outputpure"))
        {
            string line1;
            while ((line1 = reader1.ReadLine()) != null)
            {
                if (!string.IsNullOrWhiteSpace(line1))
                {
                    string[] items1 = line1.Split(new [] { '\t' }, StringSplitOptions.RemoveEmptyEntries);
                    bug bg = new bug();
                    bg.bugid = items1[0];
                    for (int i = 1; i <= items1.Length - 1; i++)
                    {
                        bg.list1.Add(items1[i]);
                    }
                    list2.Add(bg);
                }
            }
        }

You'll notice :

  • New lines stored in line1 are checked for emptyness as soon as they are retrieved from your stream (with !string.IsNullOrWhiteSpace(line1))
  • To omit empty substrings from the return value of the string.Split method, you can use the StringSplitOptions.RemoveEmptyEntries parameter.

Hope this helps.

Upvotes: 1

Dweeberly
Dweeberly

Reputation: 4777

From your description it sounds like you want to do this:

        List <bug> bugs = new List<bug>();
        var lines = System.IO.File.ReadLines(@"/home/bugs");
        foreach (var line in lines) {
            string[] items = line.Split('\t');
            bug bg=new bug();
            bg.bugid = items[0];
            bg.list1 = items.Skip(1).OrderBy(f => f).Distinct().ToList();
            bugs.Add(bg);
            }

This will produce a list of objects, where each object has a unique list of filenames.

Upvotes: 1

Related Questions