Reputation: 839
I've got two files (I only show the beginning of these files) :
patterns.txt
m64071_201130_104452/13
m64071_201130_104452/26
m64071_201130_104452/46
m64071_201130_104452/49
m64071_201130_104452/113
m64071_201130_104452/147
myfile.txt
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I should get an output like that :
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
I want to create a new file if the lines in patterns.txt match with the lines in myfile.txt . I need to keep the letters ACTG associated with the pattern in question. I use :
for i in $(cat patterns.txt); do
grep -A 1 $i myfile.txt; done > my_newfile.txt
It works, but it's very slow to create the new file... The files I work on are pretty large but not too much (14M for patterns.txt and 700M for myfile.txt).
I also tried to use grep -v
because I have the another file which contains the others patterns of myfile.txt not present in patterns.txt. But it is the same "speed filling file" problem.
If you see a solution..
Upvotes: 3
Views: 634
Reputation: 37404
Another awk:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) # if first part found in hash
print # output and store found result in var tf
if(getline && tf) # read next record and if previous record was found
print # output
}' patterns myfile
Output:
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
Edit: To output the ones not found:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) { # if first part found in hash
getline # consume the next record too
next
}
print # otherwise output
}' patterns myfile
Output:
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
Upvotes: 4
Reputation: 133538
With your shown samples please try following. Written and tested in GNU awk
.
awk '
FNR==NR{
arr[$0]
next
}
/^>/{
found=0
match($0,/.*\//)
if((substr($0,RSTART+1,RLENGTH-2)) in arr){
print
found=1
}
next
}
found
' patterns.txt myfile.txt
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition which will be TRUE when patterns.txt is being read.
arr[$0] ##Creating array with index of current line.
next ##next will skip all further statements from here.
}
/^>/{ ##Checking condition if line starts from > then do following.
found=0 ##Unsetting found here.
match($0,/.*\//) ##using match to match a regex to till / in current line.
if((substr($0,RSTART+1,RLENGTH-2)) in arr){ ##Checking condition if sub string of matched regex is present in arr then do following.
print ##Printing current line here.
found=1 ##Setting found to 1 here.
}
next ##next will skip all further statements from here.
}
found ##Printing the line if found is set.
' patterns.txt myfile.txt ##Mentioning Input_file names here.
Upvotes: 8