AishwaryaKulkarni
AishwaryaKulkarni

Reputation: 784

Extracting information from lines having columns occuring more than once

I have a file :

chr1 1234 2345 EG1234:E1
chr1 2350 2673 EG1234:E2
chr1 2673 2700 EG1234:E2
chr1 2700 2780 EG1234:E2
chr2 5672 5700 EG2345:E1
chr2 5705 5890 EG2345:E2
chr2 6000 6010 EG2345:E3
chr2 6010 6020 EG2345:E3

As you can see there is a specific ID before ':' and there is an id that is repeated after ':' which might be common to more than one row , I want an output that looks something like this:

chr1 1234 2345 EG1234:E1 (output as it is since it doesn't have duplicate id in the next row)
chr1 2350 2780 EG1234:E2 (since duplicate the 1st and 2nd column of 1st occurrence & 
3rd and 4 th column of the last occurence) 
similarly
  chr2 5672 5700 EG2345:E1
  chr2 5705 5890 EG2345:E2
  chr2 6000 6020 EG2345:E3

I was trying to use a key to move to next column but I am not quiet sure as to how would I extract the column wise values

 awk '{key=$4; if (!(key in data)) c[++n]=key; data[key]=$0} END{for (i=1; i<=n; i++) print data[c[i]]}' file1

In short I want to extract the first two columns of first occurrence and last two columns from the last occurrence of any rows with duplicate 4 th column

Upvotes: 0

Views: 52

Answers (2)

James Brown
James Brown

Reputation: 37464

This one only messes up the record order:

($1 FS $4 in a) {                            # combination of $1 and $4 is the key
    split(a[$1 FS $4],b)                     # split to get the old $2
    a[$1 FS $4]=b[1] FS b[2] FS $3 FS b[4]   # update $3
    next
}
{
    a[$1 FS $4]=$0                           # new key found
}
END {
    for(i in a)                              # print them all
        print a[i]
}

Test it:

$ awk -f foo.awk foo.txt
chr1 EG1234:E2 2350 2780
chr2 EG2345:E1 5672 5700
chr2 EG2345:E2 5705 5890
chr2 EG2345:E3 6000 6020
chr1 EG1234:E1 1234 2345

One-liner:

$ awk '($1 FS $4 in a) {split(a[$1 FS $4],b); a[$1 FS $4]=b[1] FS b[2] FS $3 FS b[4]; next} {a[$1 FS $4]=$0} END {for(i in a) print a[i]}' foo.txt

Upvotes: 2

Inian
Inian

Reputation: 85865

Using awk, considering the key1:key2 as a unique combination and if applying it to filter duplicates. Here $4 represents the key1:key2 from your file.

awk '!seen[$4]++' file

chr1 1234 2345 EG1234:E1
chr1 2350 2673 EG1234:E2
chr2 5672 5700 EG2345:E1
chr2 5705 5890 EG2345:E2
chr2 6000 6010 EG2345:E3

The logic is straight forward, the line identified by key1:key2 is printed only if it is not seen already.

Upvotes: 1

Related Questions