goe
goe

Reputation: 5415

List only duplicate lines based on one column from a semi-colon delimited file?

I have a file with a bunch of lines. Each one of these lines has 8 semi-colon delimited columns.

How can I (in Linux) return duplicate lines but only based on column number 2? Should I be using grep or something else?

Upvotes: 15

Views: 33923

Answers (9)

Konstantin Glukhov
Konstantin Glukhov

Reputation: 2186

One liner with no sorting:

awk '{d[$2][a[$2]++]=$0} END{for (i in a) {if (a[i] > 1) for (j in d[i]) {print d[i][j]}}}'

Upvotes: 2

αғsнιη
αғsнιη

Reputation: 2761

Simple awk only approach to remove unique rows based on column#2 (or return duplicated rows based on column#2); You may need to change to your expected target column or combination of multiple columns $X$Y.

awk -F\; 'NR==FNR{s[$2]++;next} (s[$2]>1)' infile infile

Upvotes: 1

Hai Vu
Hai Vu

Reputation: 40733

See my comments in the awk script

$ cat data.txt 
John Thomas;jd;301
Julie Andrews;jand;109
Alex Tremble;atrem;415
John Tomas;jd;302
Alex Trebe;atrem;416

$ cat dup.awk 
BEGIN { FS = ";" }

{
    # Keep count of the fields in second column
    count[$2]++;

    # Save the line the first time we encounter a unique field
    if (count[$2] == 1)
        first[$2] = $0;

    # If we encounter the field for the second time, print the
    # previously saved line
    if (count[$2] == 2)
        print first[$2];

    # From the second time onward. always print because the field is
    # duplicated
    if (count[$2] > 1)
        print
}

Example output:

$ sort -t ';' -k 2 data.txt | awk -f dup.awk

John Thomas;jd;301
John Tomas;jd;302
Alex Tremble;atrem;415
Alex Trebe;atrem;416

Here is my solution #2:

awk -F';' '{print $2}' data.txt |sort|uniq -d|grep -F -f - data.txt

The beauty of this solution is it preserve the line order at the expense of using many tools together (awk, sort, uniq, and fgrep).

The awk command prints out the second field, whose output is then sorted. Next, the uniq -d command picks out the duplicated strings. At this point, the standard output contains a list of duplicated second fields, one per line. We then pipe that list into fgrep. The '-f -' flag tells fgrep to look for these strings from the standard input.

Yes, you can go all out with command line. I like the second solution better for exercising many tools and for a clearer logic (at least to me). The drawback is the number of tools and possibly memory used. Also, the second solution is inefficient because it it scans the data file twice: the first time with the awk command and the second with the fgrep command. This consideration matters only when the input file is large.

Upvotes: 22

Jonathan Leffler
Jonathan Leffler

Reputation: 753990

As @mjv surmised - awk (or Perl, or Python) is a better choice:

awk -F';' ' {
    if (assoc[$2]) {          # This field 2 has been seen before
        if (assoc[$2] != 1) { # The first occurrence has not been printed
            print assoc[$2];  # Print first line with given $2
            assoc[$2] = 1;    # Reset array entry so we know we've printed it;
                              # a full line has 8 fields with semi-colons and
                              # cannot be confused with 1.
        }
        print $0;             # Print this duplicate entry
    }
    else {
        assoc[$2] = $0;       # Record line in associative array, indexed by
                              # second field.  
    }
}' <<!
a;b;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;2;c;d;e;f;g;h
a;z;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;4;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;x;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;4;c;d;e;f;g;h
!

This works, but it can slightly reorder the data - because it prints the first occurrence of a duplicated line when the second occurrence appears. The sample output is:

a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;c;c;d;e;f;g;h
a;1;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;q;c;d;e;f;g;h
a;4;c;d;e;f;g;h
a;4;c;d;e;f;g;h

This variant of the awk script reorders the testing, leading to a slightly more compact notation. It also explicitly ignores malformed data lines that do not contain 8 fields separated by semi-colons. It is packaged as a shell script, but without any option handling so you can only supply a list of files to scan (it reads standard input if there are no files listed). I removed the Perl-ish semi-colons in the script; awk does not need them.

#!/bin/sh

awk -F';' '
NF == 8 {
    if (!assoc[$2]) assoc[$2] = $0
    else if (assoc[$2] != 1)
    {
        print assoc[$2]
        assoc[$2] = 1
        print $0
    }
    else print $0
}' "$@"

Also, @mjv commented that there could be memory problems with solution such as this one if the input is huge because it keeps a record of each distinct field 2 value in the associative array 'assoc'. We can eliminate that if the data fed into awk is sorted, something we can ensure using sort, of course. Here's a variant script that does deal with monstrous inputs (because sort spills data to disk if necessary to hold intermediate results):

sort -t';' -k 2,2 "$@" |
awk -F';' '
BEGIN { last = ";"; line = "" }
NF == 8 {
    if ($2 != last)
    {
        last = $2
        line = $0
    }
    else if (line != "")
    {
        print line
        line = ""
        print $0
    }
    else print $0;
}'

This only keeps a copy of one line of input. The output from the sample data is given in sorted order, of course.

Upvotes: 3

Jim Dennis
Jim Dennis

Reputation: 17500

I'm assuming that you're not relying on any particular ordering of the input (that it might not have been pre-sorted on the key (second) field) and that you'd prefer to preserve the order of the input lines in your output ... printing copies of the first and all subsequent lines which contain duplicate values in the second field.

Here's the quickest snippet of code I could come up with in Python:

    import fileinput
    seen = dict()
    for line in fileinput.input():
        fields = line.split(';')
        key = fields[1]
        if key in seen:
            if not seen[key][0]:
                print seen[key][1],
                seen[key] = (True, seen[key][1])
            print line,
        else:
            seen[key] = (False, line)

The fileinput module lets us handle our input lines in a manner similar to the default awk file/input processing ... or to the semantics of Perl's -n command line switch.

From there we simply keep track of the first line we see with a unique value in the second field, and a flag indicating whether we've printed this one before. When we first find a duplicate we print the first line that had that key, and mark it as having been printed, then we print the current line. For all subsequent duplicates we just print the current line. Obviously for any non-dupe we simply post it as an entry to our dictionary.

There's probably a more elegant way to handle that "first dupe" boolean ... but this was most obvious to me and shouldn't pose any undo additional overhead. Creating a very simple object/class with its own state (I've been printed) would be an option. But I think that would make the overall gist of the code more difficult to understand.

It should be obvious that this can be done in any scripting or programming language which support for associative arrays (hashes, dictionaries, tables whatever your preferred language calls them). The only difference between this code and most of the other examples I've seen in this thread is in the assumptions I'm making about your requirements (that you'd prefer to preserve the relative order of the input and output lines).

Upvotes: 0

Mr.Ree
Mr.Ree

Reputation: 8418

Borrowing from Hai Vu:

% cat data.txt
John Thomas;jd;301
Julie Andrews;jand;109
Alex Tremble;atrem;415
John Tomas;jd;302
Alex Trebe;atrem;416

There's the really easy way (with gnu-sort & gawk):
(Though this will re-order output!)
(Caveat: Without --stable, the sort can reorder lines so the second occurrence comes before the first. Watch out for that!)

cat data.txt | sort -k2,2 -t';' --stable | gawk -F';' '{if ( $2==old ) { print $0 }; old=$2; }'

There's also the perl way...

cat data.txt | perl -e 'while(<>) { @data = split(/;/); if ( defined( $test{$data[1]} ) ) { print $_; } $test{$data[1]} = $_; }'

.

Upvotes: 3

mjv
mjv

Reputation: 75185

grep may do it, but I'm guessing you'll have a much easier time with awk (aka gawk, on some systems).

The effective chain / script to be used for your need depends on a few extra bits of info. For example, is the input file readily sorted, how big is the input (or rather is it huge or a stream)...

Assuming sorted input (either originally or from piping through sort), the awk script would look something like that: (attention untested)

Check the solution provided by Jonathan Leffler or Hai Vu, for a way to achieve the same without the pre-sort requirement.

#!/usr/bin/awk
# *** Simple AWK script to output duplicate lines found in input ***
#    Assume input is sorted on fields

BEGIN {
    FS = ";";   #delimiter
    dupCtr = 0;       # number of duplicate _instances_
    dupLinesCtr = 0;  # total number of duplicate lines

    firstInSeries = 1;   #used to detect if this is first in series

    prevLine = "";
    prevCol2 = "";  # use another string in case empty field is valid
}

{
  if ($2 == prevCol2) {
    if (firstInSeries == 1) {
      firstInSeries = 0;
      dupCtr++;
      dupLinesCtr++;
      print prevLine
    }
    dupLinesCtr++;
    print $0
  }
  else
     firstInSeries = 1
  prevCol2 = $2
  prevLine = $0
}

END { #optional display of counts etc.
  print "*********"
  print "Total duplicate instances = " iHits "   Total lines = " NR;
}

Upvotes: 1

sud03r
sud03r

Reputation: 19759

how about:

 sort -t ';' -k 2 test.txt | awk -F';' 'BEGIN{curr="";prev="";flag=0} \
                     NF==8{ prev=curr;
                            curr=$2;
                            if(prev!=curr){flag=1}
                            if(flag!=0 && prev==curr)flag++ ; 
                            if(flag==2)print $0}'

I also tried uniq command which has option for displaying repeated lines "-d" but unable to figure out if can be used with fields.

Upvotes: 1

jtbandes
jtbandes

Reputation: 118691

Have a convoluted awk script.

awk 'BEGIN { FS=";" } { c[$2]++; l[$2,c[$2]]=$0 } END { for (i in c) { if (c[i] > 1) for (j = 1; j <= c[i]; j++) print l[i,j] } }' file.txt

It works by keeping a counter of all occurrences of each value in the second field, and the lines which have that value, then it prints out the lines that have counters greater than 1.

Replace all instances of $2 with whichever field number you need, and the file.txt at the end with your filename.

Upvotes: 9

Related Questions