LP_640
LP_640

Reputation: 579

add a line and characters after last match(es)

I have a file that I need to append a line to (and a couple of characters) after the last match of the beginning of a line (the 3 digit number). The data are grouped by (many) gene number (122,239,...), and each gene may have variable numbers of individuals.

cat test

122_mex1 TGCAGGC
122_mex2 TGCAGTC
122_mex3 TGCAGTC
122_can4 TGCATTT
239_mex1 TGCAAAA
239_mex2 TGCAAAA
239_can4 TGCAGCA
...
567_can4 TGCAAAT

The output should look like this:

cat output

122_mex1 TGCAGGC
122_mex2 TGCAGTC
122_mex3 TGCAGTC
122_can4 TGCATTT
//|1
239_mex1 TGCAAAA
239_mex2 TGCAAAA
239_can4 TGCAGCA
//|2

etc.

How then to find the last match of each gene number that starts each line and append a line with some characters, counting up (1, 2, 3, etc)?

I have found a way to append a line after a provided match (e.g. 122)

awk '/122/{seen++} seen && !/122/{print "//|1"; seen=0} 1' test

but id like to append for all gene numbers (122, 239, 455, 234, etc), looping over genes and appending each successive gene number on the following line "//i".

Any thoughts on how to start this?

Thanks!

Upvotes: 2

Views: 133

Answers (2)

anubhava
anubhava

Reputation: 784968

You can use awk:

awk -F_ 'p!=""{p=$1;next} p != $1 {p=$1; print "//|" ++i} 1; END{print "//|" ++i}' test
122_mex2 TGCAGTC
122_mex3 TGCAGTC
122_can4 TGCATTT
//|1
239_mex1 TGCAAAA
239_mex2 TGCAAAA
239_can4 TGCAGCA
//|2

Explanation:

-F_                     # set field separator as _
p!=""{p=$1;next}        # first time if p is not set, set p=$1 and move to next line
p != $1                 # if 1st field is != previous value of 1st field
{p=$1; print "//|" ++i} # set p=$1 and print divider line with an incrementing var
1;                      # default action to print each record
END{print "//|" ++i}    # END block to print divider line last time

Upvotes: 1

Chris Seymour
Chris Seymour

Reputation: 85775

This will do the trick:

$ awk -F_ 'NR>1 && last!=$1{print "//|"++i}{last=$1}1' test
122_mex1 TGCAGGC
122_mex2 TGCAGTC
122_mex3 TGCAGTC
122_can4 TGCATTT
//|1
239_mex1 TGCAAAA
239_mex2 TGCAAAA
239_can4 TGCAGCA
//|2
...
//|3
567_can4 TGCAAAT

To save the results use shell redirection:

$ awk -F_ 'NR>1 && last!=$1{print "//|"++i}{last=$1}1' test > output

Upvotes: 3

Related Questions