Susheel Busi
Susheel Busi

Reputation: 163

Partial matches in 2 columns following exact match

I need to do an exact match followed by a partial match and retrieve the strings from two columns. I would ideally like to do this with awk.

Input:

k141_18046_1    k141_18046_1
k141_18046_1    k141_18046_2
k141_18046_2    k141_18046_1
k141_12033_1    k141_18046_2
k141_12033_1    k141_12033_1
k141_12033_2    k141_12033_2
k141_2012_1     k141_2012_1
k141_2012_1     k141_2012_2
k141_2012_2     k141_2012_1
k141_21_1     k141_2012_2
k141_21_1       k141_21_1
k141_21_2       k141_21_2

Expected output:

k141_18046_1    k141_18046_2
k141_18046_2    k141_18046_1
k141_2012_1     k141_2012_2
k141_2012_2     k141_2012_1

In both columns, the first part of the ID is the same. I need to get the IDs where either ID_1 && ID_2 (OR) ID_2 && ID_1 are present in a single row.

Thank you, Susheel

Upvotes: 0

Views: 176

Answers (1)

James Brown
James Brown

Reputation: 37394

Updated based on comments:

$ awk '
$1!=$2 {                     # consider only unequal strings
    n=split($1,a,/_/)        # split them by undescored
    m=split($2,b,/_/)
    if(m==n) {               # there should be equal amount of parts
        for(i=1;i<n;i++)  
            if(a[i]!=b[i])   # all but last parts should equal
                next         # or not valid
    } else
        next
    print                    # if you made it so far...
}' file

Output:

k141_18046_1    k141_18046_2
k141_18046_2    k141_18046_1
k141_2012_1     k141_2012_2
k141_2012_2     k141_2012_1

Another awk, using match()

$ awk '
substr($1,match($1,/^.*_/),RLENGTH) == substr($2,match($2,/^.*_/),RLENGTH) && 
substr($1,match($1,/[^_]*$/),RLENGTH) != substr($2,match($2,/[^_]*$/),RLENGTH)
' file

Upvotes: 1

Related Questions