Manuel
Manuel

Reputation: 205

use a modified version of each line of a text file as argument in a sed command (bash)

I need to extract all text between two strings from file1. The first string is each line of file2 and the second string is always "Lambda". However, I don't know how to state each string of file2 in my sed command. Also, I need to remove a ">" at the beginning of each line of file2 in order to match the contents of file1:

example file1:

some_text1

random text

Lambda

some_text2

random text

Lambda

some_text3

random text

Lambda

example file2:

>some_text1
>some_text3

I´ve come up with this incomplete command for 1 line:

sed -n '/**line from file2, without ">" at the beginning**/,/^Lambda/p' file1

And, although incomplete, this would be my idea for a loop (this does not include removing the >, which I also need in the command):

for line in file1; do sed -n '/$line/,/^Lambda/p' file1; done

Example output (note that some_text2 is not present since it isn't on file2:

some_text1

random text

Lambda
some_text3

random text

Lambda

What can I do?

Upvotes: 0

Views: 197

Answers (4)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2865

try

{mawk/mawk2/gawk} 'BEGIN { FS = "^[>]"; FN = ARGV[--ARGC];     
        
        while (getline < FN) { lookupL[$2]++ }; 
        close(FN);

        FN = ARGV[ARGC] = ""; 
        FS = "^Lambda";
    } { 
        match($0, /[[:graph:]]+/); 

       if (substr($0, RSTART, RLENGTH) in lookupL) { 

           do { print; 
                if (NF>1) {break} 

           } while (getline); 
       }
   }' file1 file2

size of file2 shouldn't be much of an issue unless you're talking more than 2 GB-ish.

Upvotes: 0

M. Nejat Aydin
M. Nejat Aydin

Reputation: 10133

Using sed in a loop is mostly bad practice. You may consider using the version below, which creates first sed commands (using sed itself!), then calls the sed to process those commands:

 sed -n -f <(sed           \
     -e 's/.//'            \
     -e 's/[]\/*.[]/\\&/g' \
     -e 's%.*%/^&$/,/^Lambda$/p%' file2) file1

You may want to omit the -e 's/[]\/*.[]/\\&/g' portion if it is guaranteed that the file2 doesn't contain any of []\/*. characters. Note that the <(...) expression is process substitution in bash; it appears as a file containing the output of the command between parentheses.

Upvotes: 0

Gordon Davisson
Gordon Davisson

Reputation: 125928

You can do this much more efficiently with sed by creating a single pattern that matches all of the strings in file2, and then running it just once on file1. With your example, the pattern would be something like (some_text1|some_text3) (although this is in "extended" regex syntax, so you need to use sed -E with it). Something like this:

lines=$(sed -n 's/^>//p' file2)    # This just reads in the lines with > removed
pattern="(${lines//$'\n'/|})"      # This actually converts them to a regex pattern
sed -En "/${pattern}/,/^Lambda/ p" file1    # Extract all matching ranges

Note that if you want to require the string from file2 to match the entire line, not just somewhere in the line, you'd use:

pattern="^(${lines//$'\n'/|})\$"    # The ^ and $ anchor to the beginning & end of line

Also, be aware that if the lines from file2 contain any regex metacharacters, they'll be treated as their regex meanings; if you want them to be treated as strictly literal strings, you'll need to preprocess them to escape the shell metacharacters. If they contain /, that'll also need to be escaped.

Upvotes: 1

Charles Duffy
Charles Duffy

Reputation: 295639

Running multiple copies of sed for this is quite inefficient. The below is an awk script that only needs to read file1 a single time, no matter how long file2 is:

#!/usr/bin/env bash
awk '
  BEGIN   { in_block=0 }
  NR==FNR { array[substr($0, 2)]=1; next }
  in_block == 0 {
    for (item in array) {
      if ($0 ~ item) {
        in_block=1
        print($0)
        next
      }
    }
  }
  in_block == 1 { print }
  in_block == 1 && /^Lambda/ { in_block=0 }
' file2 file1

Upvotes: 2

Related Questions