Reputation: 24463

How to quickly delete the lines in a file that contain items from a list in another file in BASH?

I have a file called words.txt containing a list of words. I also have a file called file.txt containing a sentence per line. I need to quickly delete any lines in file.txt that contain one of the lines from words.txt, but only if the match is found somewhere between { and }.

E.g. file.txt:

Once upon a time there was a cat.
{The cat} lived in the forest.
The {cat really liked to} eat mice.

E.g. words.txt:

cat
mice

Example output:

Once upon a time there was a cat.

Is removed because "cat" is found on those two lines and the words are also between { and }.

The following script successfully does this task:

while read -r line
do
    sed -i "/{.*$line.*}/d" file.txt
done < words.txt

This script is very slow. Sometimes words.txt contains several thousand items, so the while loop takes several minutes. I attempted to use the sed -f option, which seems to allow reading a file, but I cannot find any manuals explaining how to use this.

How can I improve the speed of the script?

Upvotes: 4

Answers (6)

konsolebox

Reputation: 75618

An awk solution:

awk 'NR==FNR{a["{[^{}]*"$0"[^{}]*}"]++;next}{for(i in a)if($0~i)next;b[j++]=$0}END{printf "">FILENAME;for(i=0;i in b;++i)print b[i]>FILENAME}' words.txt file.txt

It converts file.txt directly to have the expected output.

Once upon a time there was a cat.

Uncondensed version:

awk '
    NR == FNR {
        a["{[^{}]*" $0 "[^{}]*}"]++
        next
    }
    {
        for (i in a)
            if ($0 ~ i)
                next
        b[j++] = $0
    }
    END {
        printf "" > FILENAME
        for (i = 0; i in b; ++i)
            print b[i] > FILENAME
    }
' words.txt file.txt

If files are expected to get too large that awk may not be able to handle it, we can only redirect it to stdout. We may not be able to modify the file directly:

awk '
    NR == FNR {
        a["{[^{}]*" $0 "[^{}]*}"]++
        next
    }
    {
        for (i in a)
            if ($0 ~ i)
                next
    }
    1
' words.txt file.txt

Upvotes: 4

Tom Fenech

Reputation: 74705

You could do this in two steps:

Wrap each word in words.txt with {.* and .*}:

 awk '{ print "{.*" $0 ".*}" }' words.txt > wrapped.txt

Use grep with inverse match:
```
 grep -v -f wrapped.txt file.txt
```

This would be particularly useful if words.txt is very large, as a pure-awk approach (storing all the entries of words.txt in an array) would require a lot of memory.

If would prefer a one-liner and would like to skip creating the intermediate file you could do this:

awk '{ print "{.*" $0 ".*}" }' words.txt | grep -v -f - file.txt

The - is a placeholder which tells grep to use stdin

update

If the size of words.txt isn't too big, you could do the whole thing in awk:

awk 'NR==FNR{a[$0]++;next}{p=1;for(i in a){if ($0 ~ "{.*" i ".*}") { p=0; break}}}p' words.txt file.txt

expanded:

awk 'NR==FNR { a[$0]++; next }
     { 
         p=1
         for (i in a) {
             if ($0 ~ "{.*" i ".*}") { p=0; break }
         }
     }p' words.txt file.txt

The first block builds an array containing each line in words.txt. The second block runs for every line in file.txt. A flag p controls whether the line is printed. If the line matches the pattern, p is set to false. When the p outside the last block evaluates to true, the default action occurs, which is to print the line.

Upvotes: 1

Charles Duffy

Reputation: 296049

In pure native bash (4.x):

#!/bin/env bash4
# ^-- MUST start with a /bin/bash shebang, NOT /bin/sh

readarray -t words <words.txt          # read words into array
IFS='|'                                # use | as delimiter when expanding $*
words_re="[{].*(${words[*]}).*[}]"     # form a regex matching all words
while read -r; do                      # for each line in file...
  if ! [[ $REPLY =~ $words_re ]]; then # ...check whether it matches...
    printf '%s\n' "$REPLY"             # ...and print it if not.
  fi
done <file.txt

Native bash is somewhat slower than awk, but this still is a single-pass solution (O(n+m), whereas the sed -i approach was O(n*m)), making it vastly faster than any iterative approach.

Upvotes: 1

pgl

Reputation: 8031

I think this should work for you:

sed -e 's/.*/{.*&.*}/' words.txt | grep -vf- file.txt > out ; mv out file.txt

This basically just modifies the words.txt file on the fly and uses it as a word file for grep.

Upvotes: 2

Elwinar

Reputation: 9519

In think that using the grep command should be way faster. By example:

grep -f words.txt -v file.txt

The f option make grep use the words.txt file as matching patterns
The v option reverse the matching, ie keeping files that do not match one of the patterns.

It doesn't solve the {} constraint, but that is easily avoidable, for example by adding the brackets to the pattern file (or in a temporary file created at runtime).

Upvotes: 2

Subbeh

Reputation: 924

you can use grep to match 2 files like this:

grep -vf words.txt file.txt

Upvotes: 2

How to quickly delete the lines in a file that contain items from a list in another file in BASH?

Answers (6)

update

Related Questions