Reputation: 773
I have two files
file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase gene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase gene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
file2
betaNACtes5
CR18275
28SrRNA-Psi:CR45859
CR32821
What I want: if there is a match of any line in file2 with column 13 (partial match because of the " ") of file1 I want to change the string in column 4 to "pseudogene" otherwise nothing should be done.
Desired output
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase gene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
non-coding X FlyBase pseudogene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
So far I can get the matches, but I can't do the rest.
grep -Ff file2 file1
Upvotes: 3
Views: 111
Reputation: 133428
With your shown samples, please try following awk
code. This will preserve whitespaces present in Input_file1 also.
awk '
BEGIN{ s1="\"" }
FNR==NR{
arr[s1 $0 s1";"]
next
}
{
match($0,/^([^[:space:]]+[[:space:]]+){3}/)
firstPart=substr($0,RSTART,RLENGTH)
$0=substr($0,RSTART+RLENGTH)
match($0,/^[^ ]+/)
restPart=substr($0,RSTART+RLENGTH)
print firstPart ($NF in arr?"pseudogene":substr($0,RSTART,RLENGTH)) restPart
}
' file2 file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ s1="\"" } ##Setting s1 to " in BEGIN section.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file2 is being read.
arr[s1 $0 s1";"] ##Creating arr array with index of s1 current line s1 semi colon here.
next ##next will skip all further statements from here.
}
{
match($0,/^([^[:space:]]+[[:space:]]+){3}/) ##using match function to match 1st 3 fields here.
firstPart=substr($0,RSTART,RLENGTH) ##Saving matched part into firstPart to be used later on.
$0=substr($0,RSTART+RLENGTH) ##Saving rest of the matched line into current line.
match($0,/^[^ ]+/) ##matching everything from starting till 1st space in current line to get 4th field and rest of line value here.
restPart=substr($0,RSTART+RLENGTH) ##Creating restpart variable which has everything after 4th field value here.
print firstPart ($NF in arr?"pseudogene":substr($0,RSTART,RLENGTH)) restPart ##Printing firstPart then pseudogene OR 4th field and restPart as per need.
}
' file2 file1 ##Mentioning Input_file names here.
Upvotes: 3
Reputation: 203189
Using GNU awk for the 3rd arg to match() and \s/\S
shorthand:
$ cat tst.awk
NR==FNR {
genes["\""$1"\";"]
next
}
$NF in genes {
match($0,/((\S+\s+){3})\S+(.*)/,a)
$0 = a[1] "pseudogene" a[3]
}
{ print }
$ awk -f tst.awk file2 file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase pseudogene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
alternatively, using any POSIX awk:
$ cat tst.awk
NR==FNR {
genes["\""$1"\";"]
next
}
$NF in genes {
match($0,/([^[:space:]]+[[:space:]]+){3}/)
tail = substr($0,RLENGTH+1)
sub(/[^[:space:]]+/,"",tail)
$0 = substr($0,1,RLENGTH) "pseudogene" tail
}
{ print }
$ awk -f tst.awk file2 file1
non-coding X FlyBase gene 20025099 20025170 . + . gene_id "FBgn0052826"; gene_symbol "tRNA:Pro-CGG-1-1";
non-coding X FlyBase gene 19910168 19910521 . - . gene_id "FBgn0052821"; gene_symbol "CR32821";
non-coding X FlyBase pseudogene 476857 479309 . - . gene_id "FBgn0029523"; gene_symbol "CR18275";
non-coding X FlyBase pseudogene 15576355 15576964 . + . gene_id "FBgn0262163"; gene_symbol "betaNACtes5";
Upvotes: 2