Reputation: 24373
I have a file called words.txt
containing a list of words. I also have a file called file.txt
containing a sentence per line. I need to quickly delete any lines in file.txt
that contain one of the lines from words.txt
, but only if the match is found somewhere between {
and }
.
E.g. file.txt
:
Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.
E.g. words.txt
:
cat
mice
Example output:
Once upon a time there was a cat.
Is removed because "cat" is found on those two lines and the words are also between {
and }
.
The following script successfully does this task:
while read -r line
do
sed -i "/{.*$line.*}/d" file.txt
done < words.txt
This script is very slow. Sometimes words.txt
contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f
option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.
How can I improve the speed of the script?
Upvotes: 4
Views: 132
Reputation: 75488
An awk solution:
awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt
It converts file.txt
directly to have the expected output.
Once upon a time there was a cat.
Uncondensed version:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
b[j++] = $0
}
END {
printf "" > FILENAME
for (i = 0; i in b; ++i)
print b[i] > FILENAME
}
' words.txt file.txt
If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:
awk '
NR == FNR {
a["{[^{}]*" $0 "[^{}]*}"]++
next
}
{
for (i in a)
if ($0 ~ i)
next
}
1
' words.txt file.txt
Upvotes: 4
Reputation: 74615
You could do this in two steps:
Wrap each word in words.txt
with {.*
and .*}
:
awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt
Use grep
with inverse match:
grep -v -f wrapped.txt file.txt
This would be particularly useful if words.txt
is very large, as a pure-awk approach (storing all the entries of words.txt
in an array) would require a lot of memory.
If would prefer a one-liner and would like to skip creating the intermediate file you could do this:
awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt
The -
is a placeholder which tells grep
to use stdin
If the size of words.txt
isn't too big, you could do the whole thing in awk
:
awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt
expanded:
awk 'NR==FNR { a[$0]++; next }
{
p=1
for (i in a) {
if ($0 ~ "{.*" i ".*}") { p=0; break }
}
}p' words.txt file.txt
The first block builds an array containing each line in words.txt
. The second block runs for every line in file.txt
. A flag p
controls whether the line is printed. If the line matches the pattern, p
is set to false. When the p
outside the last block evaluates to true, the default action occurs, which is to print the line.
Upvotes: 1
Reputation: 295403
In pure native bash (4.x):
#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh
readarray -t words <words.txt # read words into array
IFS='|' # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]" # form a regex matching all words
while read -r; do # for each line in file...
if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
printf '%s\n' "$REPLY" # ...and print it if not.
fi
done <file.txt
Native bash is somewhat slower than awk, but this still is a single-pass solution (O(n+m)
, whereas the sed -i
approach was O(n*m)
), making it vastly faster than any iterative approach.
Upvotes: 1
Reputation: 7981
I think this should work for you:
sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt
This basically just modifies the words.txt
file on the fly and uses it as a word file for grep
.
Upvotes: 2
Reputation: 9509
In think that using the grep
command should be way faster. By example:
grep -f words.txt -v file.txt
f
option make grep use the words.txt
file as matching patternsv
option reverse the matching, ie keeping files that do not match one of the patterns.It doesn't solve the {}
constraint, but that is easily avoidable, for example by adding the brackets to the pattern file (or in a temporary file created at runtime).
Upvotes: 2
Reputation: 924
you can use grep to match 2 files like this:
grep -vf words.txt file.txt
Upvotes: 2