Reputation: 205
I need to extract all text between two strings from file1. The first string is each line of file2 and the second string is always "Lambda". However, I don't know how to state each string of file2 in my sed command. Also, I need to remove a ">" at the beginning of each line of file2 in order to match the contents of file1:
example file1:
some_text1
random text
Lambda
some_text2
random text
Lambda
some_text3
random text
Lambda
example file2:
>some_text1
>some_text3
I´ve come up with this incomplete command for 1 line:
sed -n '/**line from file2, without ">" at the beginning**/,/^Lambda/p' file1
And, although incomplete, this would be my idea for a loop (this does not include removing the >, which I also need in the command):
for line in file1; do sed -n '/$line/,/^Lambda/p' file1; done
Example output (note that some_text2 is not present since it isn't on file2:
some_text1
random text
Lambda
some_text3
random text
Lambda
What can I do?
Upvotes: 0
Views: 197
Reputation: 2865
try
{mawk/mawk2/gawk} 'BEGIN { FS = "^[>]"; FN = ARGV[--ARGC];
while (getline < FN) { lookupL[$2]++ };
close(FN);
FN = ARGV[ARGC] = "";
FS = "^Lambda";
} {
match($0, /[[:graph:]]+/);
if (substr($0, RSTART, RLENGTH) in lookupL) {
do { print;
if (NF>1) {break}
} while (getline);
}
}' file1 file2
size of file2 shouldn't be much of an issue unless you're talking more than 2 GB-ish.
Upvotes: 0
Reputation: 10133
Using sed
in a loop is mostly bad practice. You may consider using the version below, which creates first sed
commands (using sed
itself!), then calls the sed
to process those commands:
sed -n -f <(sed \
-e 's/.//' \
-e 's/[]\/*.[]/\\&/g' \
-e 's%.*%/^&$/,/^Lambda$/p%' file2) file1
You may want to omit the -e 's/[]\/*.[]/\\&/g'
portion if it is guaranteed that the file2 doesn't contain any of []\/*.
characters. Note that the <(...)
expression is process substitution in bash
; it appears as a file containing the output of the command between parentheses.
Upvotes: 0
Reputation: 125928
You can do this much more efficiently with sed
by creating a single pattern that matches all of the strings in file2, and then running it just once on file1. With your example, the pattern would be something like (some_text1|some_text3)
(although this is in "extended" regex syntax, so you need to use sed -E
with it). Something like this:
lines=$(sed -n 's/^>//p' file2) # This just reads in the lines with > removed
pattern="(${lines//$'\n'/|})" # This actually converts them to a regex pattern
sed -En "/${pattern}/,/^Lambda/ p" file1 # Extract all matching ranges
Note that if you want to require the string from file2 to match the entire line, not just somewhere in the line, you'd use:
pattern="^(${lines//$'\n'/|})\$" # The ^ and $ anchor to the beginning & end of line
Also, be aware that if the lines from file2 contain any regex metacharacters, they'll be treated as their regex meanings; if you want them to be treated as strictly literal strings, you'll need to preprocess them to escape the shell metacharacters. If they contain /
, that'll also need to be escaped.
Upvotes: 1
Reputation: 295639
Running multiple copies of sed
for this is quite inefficient. The below is an awk script that only needs to read file1
a single time, no matter how long file2
is:
#!/usr/bin/env bash
awk '
BEGIN { in_block=0 }
NR==FNR { array[substr($0, 2)]=1; next }
in_block == 0 {
for (item in array) {
if ($0 ~ item) {
in_block=1
print($0)
next
}
}
}
in_block == 1 { print }
in_block == 1 && /^Lambda/ { in_block=0 }
' file2 file1
Upvotes: 2