Reputation: 1
I have a file containing row numbers (I'll call it recordsToRemove) that need to be removed from a separate, very large file. There are quite a few rows to remove (approximately 10,000), so typing each row number to remove is not really an option. I have searched quite a bit and have yet to find a unix or similar solution to do this.
Snippet of recordsToRemove:
5
7
9
13
18
26
28
29
30
36
...
596687
596688
596689
596690
596691
596697
596700
596706
596709
596716
The file that I need to remove files from is a large file that is space delimited with 100,000+ columns:
10 -10 10 -10 10 10 10 10 10 -10 -10 0 0 -10 0 0 10 10 -10 -10
10 0 0 0 0 0 0 10 10 -10 -10 10 -10 -10 10 -10 10 10 -10 -10 10 -10 -10
10 10 -10 10 -10 -10 -10 10 10 -10 -10 0 0 -10 0 0 10 0 0 -10
10 10 -10 10 -10 -10 -10 10 10 0 -10 0 0 -10 0 0 10 0 0 -10
10 10 -10 10 -10 -10 -10 0 0 0 0 10 -10 10 10 -10 -10 10 10 10
10 0 0 0 0 0 0 0 0 0 -10 10 -10 -10 10 -10 0 10 0 0
0 10 -10 10 -10 -10 -10 10 10 -10 -10 10 -10 -10 10 -10 10 10 -10 -10
10 -10 10 -10 10 10 10 0 0 0 -10 0 0 -10 0 0 0 10 0 0
...
I Would greatly appreciate any suggestions for how to accomplish this!
Upvotes: 0
Views: 61
Reputation: 80931
You could try sticking a d
on the end of each line of recordsToRemove
and then run sed -f recordsToRemove largeFile
.
Upvotes: 1
Reputation: 195069
just give this a try, if it works there:
awk 'NR==FNR{d[0+$0]=7;next}!d[FNR]' recordsToRemove bigFile > Result.txt
save the rowNo. to be deleted in memory as hashtable entries. 10k entries are nothing for your memory I think.
Then check each row No. in that bigfile, if it exists in the hashtable. the get()
method of a hashtable would be considered as O(1)
, performance should not be a problem. Just give it a try on your real data.
Upvotes: 1