Reputation: 57
I know that there are a few similar questions to this that have previously been answered, but I haven't managed to find exactly what I want (and have tried variants of proposed solutions). Hopefully this is an easy question.
I have a tab-separated file (file.txt) with 10 columns and about half a million lines, which in simplified form looks like this:
ID Col1 Col2 Col3
a 4 2 8
b 5 6 1
c 8 4 1
d 3 5 9
e 8 5 2
I'd like to remove all the lines where, say, "b" and "d" appear in the first (ID) column. The output that I want is:
ID Col1 Col2 Col3
a 4 2 8
c 8 4 1
e 8 5 2
It is important that the order of the IDs is maintained in my output file.
In reality, there are about 100,000 lines that I want to remove. I therefore have a reference file (referencefile.txt) that lists all the IDs that I want removed from file.txt. In this example, the reference file would simply contain "b" and "d" on successive lines.
I am using grep at the moment, and while it works, it is proving painfully slow.
grep -v -f referencefile.txt file.txt
Is there a way of using awk or sed (or anything else for that matter) to speed up the process?
Many thanks.
AB
Upvotes: 0
Views: 68
Reputation: 5665
There are ways of speeding up grep
itself.
I'd suggest:
-F
treat the input in the -f referencefile.txt
as fixed strings and not regexes.
-w
match words
Possibly LC_ALL=C
- use the LC_ALL
environment variable to instruct grep to use ascii rather than UTF-8
Upvotes: 0
Reputation: 16997
Using awk
:
awk 'FNR>1 && ($1 == "b" || $1 == "d"){ next } 1' infile
# OR
awk 'FNR>1 && $1 ~ /^([bd])$/{ next } 1' infile
# To exclude line from infile, where list of ids from id_lists
# exists in first field of infile
awk 'FNR==NR{ids[$1];next}FNR>1 && ($1 in ids){next}1' id_lists infile
# To include line from infile, where list of ids from id_lists
# exists in first field of infile
awk 'FNR==NR{ids[$1];next}FNR==1 || ($1 in ids)' id_lists infile
Test Results:
Input
$ cat infile
ID Col1 Col2 Col3
a 4 2 8
b 5 6 1
c 8 4 1
d 3 5 9
e 8 5 2
Output
$ awk 'FNR>1 && $1 ~ /^([bd])$/{ next } 1' infile
ID Col1 Col2 Col3
a 4 2 8
c 8 4 1
e 8 5 2
$ awk 'FNR>1 && ($1 == "b" || $1 == "d"){ next } 1' infile
ID Col1 Col2 Col3
a 4 2 8
c 8 4 1
e 8 5 2
but "b" and "d" were for illustrative purposes, and I actually have about 100,000 IDs that I need to remove. So I want all those IDs listed in a separate file (referencefile.txt)
If you have file with list of ids like below then
To Exclude list of ids
$ cat id_lists
a
b
$ awk 'FNR==NR{ids[$1];next}FNR>1 && ($1 in ids){next}1' id_lists infile
ID Col1 Col2 Col3
c 8 4 1
d 3 5 9
e 8 5 2
To Include list of ids
$ awk 'FNR==NR{ids[$1];next}FNR==1 || ($1 in ids)' id_lists infile
ID Col1 Col2 Col3
a 4 2 8
b 5 6 1
Upvotes: 2