Paillou
Paillou

Reputation: 839

Removing lines which match with specific pattern from another file

I've got two files (I only show the beginning of these files) :

patterns.txt

m64071_201130_104452/13
m64071_201130_104452/26
m64071_201130_104452/46
m64071_201130_104452/49
m64071_201130_104452/113
m64071_201130_104452/147

myfile.txt

>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
>m64071_201130_104452/26/ccs
TAGACAATGTA

I should get an output like that :

>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA

I want to create a new file if the lines in patterns.txt match with the lines in myfile.txt . I need to keep the letters ACTG associated with the pattern in question. I use :

for i in $(cat patterns.txt); do 
     grep -A 1 $i myfile.txt; done > my_newfile.txt

It works, but it's very slow to create the new file... The files I work on are pretty large but not too much (14M for patterns.txt and 700M for myfile.txt).

I also tried to use grep -v because I have the another file which contains the others patterns of myfile.txt not present in patterns.txt. But it is the same "speed filling file" problem.

If you see a solution..

Upvotes: 3

Views: 634

Answers (2)

James Brown
James Brown

Reputation: 37404

Another awk:

$ awk -F/ '                            # / delimiter
NR==FNR {
    a[$1,$2]                           # hash patterns to a
    next
}
{
    if( tf=((substr($1,2),$2) in a) )  # if first part found in hash
        print                          # output and store found result in var tf
    if(getline && tf)                  # read next record and if previous record was found
        print                          # output
}' patterns myfile

Output:

>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA

Edit: To output the ones not found:

$ awk -F/ '                              # / delimiter
NR==FNR {
    a[$1,$2]                             # hash patterns to a
    next
}
{
    if( tf=((substr($1,2),$2) in a) ) {  # if first part found in hash
        getline                          # consume the next record too
        next
    }
    print                                # otherwise output
}' patterns myfile

Output:

>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG

Upvotes: 4

RavinderSingh13
RavinderSingh13

Reputation: 133538

With your shown samples please try following. Written and tested in GNU awk.

awk '
FNR==NR{
  arr[$0]
  next
}
/^>/{
  found=0
  match($0,/.*\//)
  if((substr($0,RSTART+1,RLENGTH-2)) in arr){
    print
    found=1
  }
  next
}
found
'  patterns.txt myfile.txt

Explanation: Adding detailed explanation for above.

awk '                         ##Starting awk program from here.
FNR==NR{                      ##Checking condition which will be TRUE when patterns.txt is being read.
  arr[$0]                     ##Creating array with index of current line.
  next                        ##next will skip all further statements from here.
}
/^>/{                         ##Checking condition if line starts from > then do following.
  found=0                     ##Unsetting found here.
  match($0,/.*\//)            ##using match to match a regex to till / in current line.
  if((substr($0,RSTART+1,RLENGTH-2)) in arr){  ##Checking condition if sub string of matched regex is present in arr then do following.
    print                     ##Printing current line here.
    found=1                   ##Setting found to 1 here.
  }
  next                        ##next will skip all further statements from here.
}
found                         ##Printing the line if found is set.
'  patterns.txt myfile.txt    ##Mentioning Input_file names here.

Upvotes: 8

Related Questions