Paul
Paul

Reputation: 321

Check and count duplicates lines

I have a input data like:

chr17   41243232    41243373    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23
chr13   32954112    32954208    BRCA2_ex24

And I need to check for duplicates rows $2 and $3 row, if is duplicated, I need to merged into one line and $4 column print as comma separated.

Output:

chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24

Is there any AWK solution to easy process this kind a data? I would appreciate explained solution. Input and output are tab-separated formats. NOTE: First, second and third fields are allays equal.

My try was:

awk -v OFS="\t" '{i=$2 FS $1 FS $3 FS $4} {a[i]=!a[i]?$4:a[i] "," $4} END {for (l in a) {print l,a[l]}}' infile

Thank you for any ideas.

Upvotes: 1

Views: 97

Answers (4)

Ed Morton
Ed Morton

Reputation: 203089

$ cat tst.awk
{
    curr = $2 FS $3

    if (curr == prev) {
        buf = buf "," $NF
    }
    else {
        if (NR>1) {
            print buf
        }
        buf = $0
    }

    prev = curr
}
END { print buf }

$ awk -f tst.awk file
chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24

The differences between this and @JamesBrown's solution are:

  1. This only stores 1 output line in memory at a time while James' stores the whole file.
  2. This will print the lines in the order they appear in the input while James' will print them in "random" (hash) order.
  3. This relies on the input with the key ($2 & $3) values contiguous as you show in your sample while James' will work for input in any order.

Upvotes: 1

Sundeep
Sundeep

Reputation: 23667

if perl is okay:

$ cat ip.txt 
chr17   41243232    41243373    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr17   41243639    41243811    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23
chr13   32954112    32954208    BRCA2_ex24

$ perl -ale '$k = join "\t",@F[0..2]; $h{$k} .= $h{$k} ? ",$F[3]" : $F[3]; END{ print "$_\t$h{$_}" foreach (keys %h) }' ip.txt 
chr17   41243639    41243811    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24
  • -ale split input line on spaces and save to @F array, strip newlines from input line and add newline for print statements
  • $k = join "\t",@F[0..2] key to work with - first 3 elements joined by tab
  • $h{$k} .= $h{$k} ? ",$F[3]" : $F[3] append values to hash variable, add , depending on existing value is empty or not
  • END{ print "$_\t$h{$_}" foreach (keys %h) } after all lines are processed, print key and value separated by tab. Order of keys is random

Alternate way by extracting key, value using regex:

$ perl -nle '($k,$v)=/^(.*?)\s+(\S+)$/; $h{$k} .= $h{$k} ? ",$v" : $v; END{print "$_\t$h{$_}" foreach (keys %h) }' ip.txt 
chr13   32954112    32954208    BRCA2_ex23,BRCA2_ex24
chr17   41243639    41243811    BRCA1_ex11
chr17   41243232    41243373    BRCA1_ex11,BRCA1_ex12
chr17   41243471    41243644    BRCA1_ex11

Upvotes: 1

CiaPan
CiaPan

Reputation: 9571

Just replace the first assignment with

 i=$1 FS $2 FS $3

And possibly filter the output through sed to replace spaces with tabs:

...  | sed 's/ /    /g'

      space---^   ^--- TAB

Output:

chr13   32954112        32954208        BRCA2_ex23,BRCA2_ex24
chr17   41243639        41243811        BRCA1_ex11
chr17   41243232        41243373        BRCA1_ex11,BRCA1_ex12
chr17   41243471        41243644        BRCA1_ex11

Upvotes: 1

James Brown
James Brown

Reputation: 37394

$ cat script.awk
{
    a[$2 OFS $3] = $1                  # store $1, last instance
    b[$2 OFS $3] = b[$2 FS $3] $4 ","  # append the $4s
} 
END {
    for (i in a) {                     # order is awk default
        sub(/,$/, "", b[i])            # remove trailing ","
        print a[i], i, b[i]            # print
    }
}

Run:

$ awk -f script.awk infile
chr17 41243471 41243644 BRCA1_ex11
chr17 41243232 41243373 BRCA1_ex11,BRCA1_ex12
chr17 41243639 41243811 BRCA1_ex11
chr13 32954112 32954208 BRCA2_ex23,BRCA2_ex24

Upvotes: 2

Related Questions