user1899415
user1899415

Reputation: 3125

Efficiently remove lines from fileA that contains string from fileB

FileA contains lines FileB contains words

How can I efficiently remove lines from FileB containing words found in FileA?

I tried the following, and I'm not even sure if they work because it's taking so long to run.

Tried grep:

grep -v -f <(awk '{print $1}' FileB.txt) FileA.txt > out

Also tried python:

f = open(sys.argv[1],'r')
out = open(sys.argv[2], 'w')
bad_words = f.read().splitlines()

with open('FileA') as master_lines:
  for line in master_lines:
    if not any(bad_word in line for bad_word in bad_words):
      out.write(line)

FileA:

abadan refinery is one of the largest in the world.
a bad apple spoils the barrel.
abaiara is a city in the south region of brazil.
a ban has been imposed on the use of faxes

FileB:

abadan
abaiara

DESIRED OUTPUT:

a bad apple spoils the barrel.
a ban has been imposed on the use of faxes

Upvotes: 2

Views: 139

Answers (3)

mdadm
mdadm

Reputation: 1363

I refuse to believe that Python can't at least match the performance of Perl on this one. This is my quick attempt at a more efficient version of solving this in Python. I'm using sets to optimize the search part of this problem. The & operator returns a new set with elements common to both sets.

This solution takes 12 seconds to run on my machine for a fileA with 3M lines and fileB with 200k of words and the perl takes 9. The biggest slow down seems to be re.split, which seems to be faster than string.split in this case.

Please comment on this answer if you have any suggestions to improve the speed.

import re

filea = open('Downloads/fileA.txt')
fileb = open('Downloads/fileB.txt')

output = open('output.txt', 'w')
bad_words = set(line.strip() for line in fileb)

splitter = re.compile("\s")
for line in filea:
    line_words = set(splitter.split(line))
    if bad_words.isdisjoint(line_words):
        output.write(line)

output.close()

Upvotes: 2

BMW
BMW

Reputation: 45223

Using grep

grep -v -Fwf fileB fileA

Upvotes: 1

jaypal singh
jaypal singh

Reputation: 77085

The commands you have look good so may be its time to try a good scripting language. Try to run the following perl script and see if it reports back any faster.

#!/usr/bin/perl

#use strict;
#use warnings;

open my $LOOKUP, "<", "fileA" or die "Cannot open lookup file: $!";
open my $MASTER, "<", "fileB" or die "Cannot open Master file: $!";
open my $OUTPUT, ">", "out" or die "Cannot create Output file: $!";

my %words;
my @l;

while (my $word = <$LOOKUP>) {
    chomp($word);
    ++$words{$word};
}

LOOP_FILE_B: while (my $line = <$MASTER>) {
    @l = split /\s+/, $line;
        for my $i (0 .. $#l) {
            if (defined $words{$l[$i]}) {
                next LOOP_FILE_B;
            }
        }
    print $OUTPUT "$line"
}

Upvotes: 1

Related Questions