GB44444
GB44444

Reputation: 57

Non-grep method to remove lines from a file where a string appears in another file

I know that there are a few similar questions to this that have previously been answered, but I haven't managed to find exactly what I want (and have tried variants of proposed solutions). Hopefully this is an easy question.

I have a tab-separated file (file.txt) with 10 columns and about half a million lines, which in simplified form looks like this:

ID     Col1      Col2     Col3
a        4        2        8
b        5        6        1
c        8        4        1
d        3        5        9
e        8        5        2

I'd like to remove all the lines where, say, "b" and "d" appear in the first (ID) column. The output that I want is:

ID     Col1      Col2     Col3
a        4        2        8
c        8        4        1
e        8        5        2

It is important that the order of the IDs is maintained in my output file.

In reality, there are about 100,000 lines that I want to remove. I therefore have a reference file (referencefile.txt) that lists all the IDs that I want removed from file.txt. In this example, the reference file would simply contain "b" and "d" on successive lines.

I am using grep at the moment, and while it works, it is proving painfully slow.

grep -v -f referencefile.txt file.txt

Is there a way of using awk or sed (or anything else for that matter) to speed up the process?

Many thanks.

AB

Upvotes: 0

Views: 68

Answers (2)

stevesliva
stevesliva

Reputation: 5665

There are ways of speeding up grep itself.

I'd suggest:

  • -F treat the input in the -f referencefile.txt as fixed strings and not regexes.

  • -w match words

  • Possibly LC_ALL=C - use the LC_ALL environment variable to instruct grep to use ascii rather than UTF-8

Upvotes: 0

Akshay Hegde
Akshay Hegde

Reputation: 16997

Using awk:

awk 'FNR>1 && ($1 == "b" || $1 == "d"){ next } 1' infile

# OR

awk 'FNR>1 && $1 ~ /^([bd])$/{ next } 1' infile

# To exclude line from infile, where list of ids from id_lists 
# exists in first field of infile
awk 'FNR==NR{ids[$1];next}FNR>1 && ($1 in ids){next}1' id_lists infile

# To include line from infile, where list of ids from id_lists 
# exists in first field of infile
awk 'FNR==NR{ids[$1];next}FNR==1 || ($1 in ids)' id_lists infile

Test Results:

Input

$ cat infile 
ID     Col1      Col2     Col3
a        4        2        8
b        5        6        1
c        8        4        1
d        3        5        9
e        8        5        2

Output

$ awk 'FNR>1 && $1 ~ /^([bd])$/{ next } 1' infile
ID     Col1      Col2     Col3
a        4        2        8
c        8        4        1
e        8        5        2

$ awk 'FNR>1 && ($1 == "b" || $1 == "d"){ next } 1' infile
ID     Col1      Col2     Col3
a        4        2        8
c        8        4        1
e        8        5        2

but "b" and "d" were for illustrative purposes, and I actually have about 100,000 IDs that I need to remove. So I want all those IDs listed in a separate file (referencefile.txt)

If you have file with list of ids like below then

To Exclude list of ids

$ cat id_lists
a
b

$ awk 'FNR==NR{ids[$1];next}FNR>1 && ($1 in ids){next}1' id_lists infile
ID     Col1      Col2     Col3
c        8        4        1
d        3        5        9
e        8        5        2

To Include list of ids

$ awk 'FNR==NR{ids[$1];next}FNR==1 || ($1 in ids)' id_lists infile
ID     Col1      Col2     Col3
a        4        2        8
b        5        6        1

Upvotes: 2

Related Questions