Reputation: 29
I want to find the pattern like: the column 2 is 'C' in a current line and the column 2 in a next line is 'G'. And the column 4 of file is 'CG'. I want to compare 1st to 2nd, 3rd to 4th, 5th to 6th, so on. Then print a couple of current line and next line. The 'C' can appear in both even and odd line.
Input like this:
chr1 C 10467 CHH CT 0.0 0 1
chr1 C 10469 CG CG 0.0 0 1
chr1 G 10470 CG CG 0.0 0 8
chr1 C 10471 CG CG 0.0 0 1
chr1 G 10472 CG CG 1.0 8 8
Expected Output is, separated by tab-delimiter:
chr1 C 10469 CG CG 0.0 0 1
chr1 G 10470 CG CG 0.0 0 8
chr1 C 10471 CG CG 0.0 0 1
chr1 G 10472 CG CG 1.0 8 8
My code is:
awk '{a=$2; c=$4; d=$0; e=NR; getline; f=$2; g=$4} {if (a == "C" && f == "G" && c == "CG" && g == "CG") {print d,e,"\n",$0,NR}}' input_file
I use getline and check if there is 'G' on a next line. The problem is, if I do that, awk will then directly go to the third line, and will miss some lines. For example, the input's column 2 is:
Line 1: G
Line 2: C
Line 3: G
Line 4: C
The expected output is Line 2 and Line 3. However, awk directly went to third line from the first line, not line by line. So, the output is none.
Kind regards!
Upvotes: 2
Views: 935
Reputation: 133518
EDIT(to compare each line with its next line use this one): Adding this solution now, with OP's new samples.
awk '
FNR>1{
if(secCol=="C" && $2=="G" && fourthCol=="CG" && $4=="CG"){
print prevLine ORS $0
}
}
{
secCol=$2
fourthCol=$4
prevLine=$0
}
' Input_file
Explanation: Adding detailed explanation for above.
awk '
##Starting awk program from here.
FNR>1{
##Checking condition if current line number is more than 1 then do following.
if(secCol=="C" && $2=="G" && fourthCol=="CG" && $4=="CG"){
##Checking condition if secCol is C AND 2nd column is G AND fourthCol is CG and 4th column is CG then do following.
print prevLine ORS $0
##Printing prevLine ORS and current line.
}
}
{
secCol=$2
##Creating secCol with 2nd column of current line.
fourthCol=$4
##Creating fourthCol with 4th column of current line.
prevLine=$0
##Setting prevLine to current line value.
}
' Input_file ##Mentioning Input_file name here.
Initial solution(this compares each odd and even lines): (OP's samples got more clear after edit but keeping this solution too here for future readers in case it helps) Could you please try following, written as per shown samples only. This checks if previous line is having 4th column(fourthCol) is CG
too or not in case you don't need it then remove && foruthCol=="CG"
from following.
awk '
FNR%2==0{
if(secCol=="C" && $2=="G" && fourthCol=="CG" && $4=="CG"){
print prevLine ORS $0
}
prevLine=secCol=fourthCol=""
next
}
{
secCol=$2
fourthCol=$4
prevLine=$0
}
' Input_file
Output will be as follows.
chr1 C 10469 CG CG 0.0 0 1
chr1 G 10470 CG CG 0.0 0 8
chr1 C 10471 CG CG 0.0 0 1
chr1 G 10472 CG CG 1.0 8 8
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR%2==0{ ##Checking condition if line number is divided by 2 or not.
if(secCol=="C" && $2=="G" && fourthCol=="CG" && $4=="CG"){
##Checking condition if secCol is C AND 2nd column is G AND fourthCol is CG and 4th column is CG then do following.
print prevLine ORS $0 ##Printing prevLine ORS and current line.
}
prevLine=secCol=fourthCol="" ##Nullifying prevLone, secCol, fourthCol here.
next ##next will skip all further statements from here.
}
{
secCol=$2 ##Creating secCol with 2nd column of current line.
fourthCol=$4 ##Creating fourthCol with 4th column of current line.
prevLine=$0 ##Setting prevLine to current line value.
}
' Input_file ##Mentioning Input_file name here.
Upvotes: 3
Reputation: 37404
Man, I understood that completely wrong first. I hope I got it right this time.
$ awk '
$2=="G" && $4=="CG" && p2=="C" && p4=="CG" {
print p ORS $0
}
{
p=$0
p2=$2
p4=$4
}' file
Output:
chr1 C 10469 CG CG 0.0 0 1
chr1 G 10470 CG CG 0.0 0 8
chr1 C 10471 CG CG 0.0 0 1
chr1 G 10472 CG CG 1.0 8 8
Explained:
awk '
$2=="G" && # the column 2 in current line is G
$4=="CG" && # And the column 4 of file is CG
p2=="C" && # the column 2 is C in a previous line
p4=="CG" { # And the column 4 of file is CG
print p ORS $0 # Then print a couple of current line and next line
}
{
p=$0 # current record is previous on next round
p2=$2 # same goes for column 2
p4=$4 # and column 4
}' file
Upvotes: 2