bobia9193
bobia9193

Reputation: 29

Compare curent line and next line in awk

I want to find the pattern like: the column 2 is 'C' in a current line and the column 2 in a next line is 'G'. And the column 4 of file is 'CG'. I want to compare 1st to 2nd, 3rd to 4th, 5th to 6th, so on. Then print a couple of current line and next line. The 'C' can appear in both even and odd line.

Input like this:

chr1    C   10467   CHH CT  0.0 0   1
chr1    C   10469   CG  CG  0.0 0   1
chr1    G   10470   CG  CG  0.0 0   8
chr1    C   10471   CG  CG  0.0 0   1
chr1    G   10472   CG  CG  1.0 8   8

Expected Output is, separated by tab-delimiter:

chr1    C   10469   CG  CG  0.0 0   1
chr1    G   10470   CG  CG  0.0 0   8
chr1    C   10471   CG  CG  0.0 0   1
chr1    G   10472   CG  CG  1.0 8   8

My code is:

awk '{a=$2; c=$4; d=$0; e=NR; getline; f=$2; g=$4} {if (a == "C" && f == "G" && c == "CG" && g == "CG") {print d,e,"\n",$0,NR}}' input_file

I use getline and check if there is 'G' on a next line. The problem is, if I do that, awk will then directly go to the third line, and will miss some lines. For example, the input's column 2 is:

Line 1: G
Line 2: C
Line 3: G
Line 4: C

The expected output is Line 2 and Line 3. However, awk directly went to third line from the first line, not line by line. So, the output is none.

Kind regards!

Upvotes: 2

Views: 935

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133518

EDIT(to compare each line with its next line use this one): Adding this solution now, with OP's new samples.

awk '
FNR>1{
  if(secCol=="C" && $2=="G" && fourthCol=="CG" && $4=="CG"){
    print prevLine ORS $0
  }
}
{
  secCol=$2
  fourthCol=$4
  prevLine=$0
}
'  Input_file

Explanation: Adding detailed explanation for above.

awk '
##Starting awk program from here.
FNR>1{
##Checking condition if current line number is more than 1 then do following.
  if(secCol=="C" && $2=="G" && fourthCol=="CG" && $4=="CG"){
##Checking condition if secCol is C AND 2nd column is G AND fourthCol is CG and 4th column is CG then do following. 
    print prevLine ORS $0
##Printing prevLine ORS and current line.
  }
}
{
  secCol=$2
##Creating secCol with 2nd column of current line.
  fourthCol=$4
##Creating fourthCol with 4th column of current line.
  prevLine=$0
##Setting prevLine to current line value.
}
'  Input_file ##Mentioning Input_file name here. 


Initial solution(this compares each odd and even lines): (OP's samples got more clear after edit but keeping this solution too here for future readers in case it helps) Could you please try following, written as per shown samples only. This checks if previous line is having 4th column(fourthCol) is CG too or not in case you don't need it then remove && foruthCol=="CG" from following.

awk '
FNR%2==0{
  if(secCol=="C" && $2=="G" && fourthCol=="CG" && $4=="CG"){
    print prevLine ORS $0
  }
  prevLine=secCol=fourthCol=""
  next
}
{
  secCol=$2
  fourthCol=$4
  prevLine=$0
}
'  Input_file

Output will be as follows.

chr1    C   10469   CG  CG  0.0 0   1
chr1    G   10470   CG  CG  0.0 0   8
chr1    C   10471   CG  CG  0.0 0   1
chr1    G   10472   CG  CG  1.0 8   8

Explanation: Adding detailed explanation for above.

awk '                          ##Starting awk program from here.
FNR%2==0{                      ##Checking condition if line number is divided by 2 or not.
  if(secCol=="C" && $2=="G" && fourthCol=="CG" && $4=="CG"){
##Checking condition if secCol is C AND 2nd column is G AND fourthCol is CG and 4th column is CG then do following.
    print prevLine ORS $0      ##Printing prevLine ORS and current line.
  }
  prevLine=secCol=fourthCol="" ##Nullifying prevLone, secCol, fourthCol here.
  next                         ##next will skip all further statements from here.
}
{
  secCol=$2                    ##Creating secCol with 2nd column of current line.
  fourthCol=$4                 ##Creating fourthCol with 4th column of current line.
  prevLine=$0                  ##Setting prevLine to current line value.
}
'  Input_file                  ##Mentioning Input_file name here. 

Upvotes: 3

James Brown
James Brown

Reputation: 37404

Man, I understood that completely wrong first. I hope I got it right this time.

$ awk '
$2=="G" && $4=="CG" && p2=="C" && p4=="CG" {
    print p ORS $0
}
{
    p=$0
    p2=$2
    p4=$4
}' file

Output:

chr1    C   10469   CG  CG  0.0 0   1
chr1    G   10470   CG  CG  0.0 0   8 
chr1    C   10471   CG  CG  0.0 0   1
chr1    G   10472   CG  CG  1.0 8   8

Explained:

awk '
$2=="G" &&            # the column 2 in current line is G
$4=="CG" &&           # And the column 4 of file is CG
p2=="C" &&            # the column 2 is C in a previous line
p4=="CG" {            # And the column 4 of file is CG
    print p ORS $0    # Then print a couple of current line and next line
}
{
    p=$0              # current record is previous on next round
    p2=$2             # same goes for column 2
    p4=$4             # and column 4
}' file

Upvotes: 2

Related Questions