Amaranta_Remedios
Amaranta_Remedios

Reputation: 773

Compare columns in two files and if match change string in another column

I have two files

file1 
non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase gene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase gene    15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";

file2
betaNACtes5
CR18275
28SrRNA-Psi:CR45859
CR32821

What I want: if there is a match of any line in file2 with column 13 (partial match because of the " ") of file1 I want to change the string in column 4 to "pseudogene" otherwise nothing should be done.

Desired output

non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase pseudogene  19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase gene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase pseudogene  15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding  X   FlyBase pseudogene  19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";

So far I can get the matches, but I can't do the rest.

grep -Ff file2 file1

Upvotes: 3

Views: 111

Answers (2)

RavinderSingh13
RavinderSingh13

Reputation: 133428

With your shown samples, please try following awk code. This will preserve whitespaces present in Input_file1 also.

awk '
BEGIN{ s1="\"" }
FNR==NR{
  arr[s1 $0 s1";"]
  next
}
{
  match($0,/^([^[:space:]]+[[:space:]]+){3}/)
  firstPart=substr($0,RSTART,RLENGTH)
  $0=substr($0,RSTART+RLENGTH)
  match($0,/^[^ ]+/)
  restPart=substr($0,RSTART+RLENGTH)
  print firstPart ($NF in arr?"pseudogene":substr($0,RSTART,RLENGTH)) restPart
}
' file2 file1

Explanation: Adding detailed explanation for above.

awk '                                          ##Starting awk program from here.
BEGIN{ s1="\"" }                               ##Setting s1 to " in BEGIN section.
FNR==NR{                                       ##Checking condition FNR==NR which will be TRUE when file2 is being read.
  arr[s1 $0 s1";"]                             ##Creating arr array with index of s1 current line s1 semi colon here.
  next                                         ##next will skip all further statements from here.
}
{
  match($0,/^([^[:space:]]+[[:space:]]+){3}/)  ##using match function to match 1st 3 fields here.
  firstPart=substr($0,RSTART,RLENGTH)          ##Saving matched part into firstPart to be used later on.
  $0=substr($0,RSTART+RLENGTH)                 ##Saving rest of the matched line into current line.
  match($0,/^[^ ]+/)                           ##matching everything from starting till 1st space in current line to get 4th field and rest of line value here.
  restPart=substr($0,RSTART+RLENGTH)           ##Creating restpart variable which has everything after 4th field value here.
  print firstPart ($NF in arr?"pseudogene":substr($0,RSTART,RLENGTH)) restPart ##Printing firstPart then pseudogene OR 4th field and restPart as per need.
}
' file2 file1                                  ##Mentioning Input_file names here.

Upvotes: 3

Ed Morton
Ed Morton

Reputation: 203189

Using GNU awk for the 3rd arg to match() and \s/\S shorthand:

$ cat tst.awk
NR==FNR {
    genes["\""$1"\";"]
    next
}
$NF in genes {
    match($0,/((\S+\s+){3})\S+(.*)/,a)
    $0 = a[1] "pseudogene" a[3]
}
{ print }

$ awk -f tst.awk file2 file1
non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase pseudogene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase pseudogene    15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";

alternatively, using any POSIX awk:

$ cat tst.awk
NR==FNR {
    genes["\""$1"\";"]
    next
}
$NF in genes {
    match($0,/([^[:space:]]+[[:space:]]+){3}/)
    tail = substr($0,RLENGTH+1)
    sub(/[^[:space:]]+/,"",tail)
    $0 = substr($0,1,RLENGTH) "pseudogene" tail
}
{ print }

$ awk -f tst.awk file2 file1
non-coding  X   FlyBase gene    20025099    20025170    .   +   .   gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding  X   FlyBase gene    19910168    19910521    .   -   .   gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding  X   FlyBase pseudogene    476857  479309  .   -   .   gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding  X   FlyBase pseudogene    15576355    15576964    .   +   .   gene_id "FBgn0262163"; gene_symbol "betaNACtes5";

Upvotes: 2

Related Questions