klpgaddis
klpgaddis

Reputation: 1

Use a file with row numbers to delete rows in a separate file

I have a file containing row numbers (I'll call it recordsToRemove) that need to be removed from a separate, very large file. There are quite a few rows to remove (approximately 10,000), so typing each row number to remove is not really an option. I have searched quite a bit and have yet to find a unix or similar solution to do this.

Snippet of recordsToRemove:

5
7
9
13
18
26
28
29
30
36
...
596687
596688
596689
596690
596691
596697
596700
596706
596709
596716

The file that I need to remove files from is a large file that is space delimited with 100,000+ columns:

10 -10 10 -10 10 10 10 10 10 -10 -10 0 0 -10 0 0 10 10 -10 -10 
10 0 0 0 0 0 0 10 10 -10 -10 10 -10 -10 10 -10 10 10 -10 -10 10 -10 -10 
10 10 -10 10 -10 -10 -10 10 10 -10 -10 0 0 -10 0 0 10 0 0 -10 
10 10 -10 10 -10 -10 -10 10 10 0 -10 0 0 -10 0 0 10 0 0 -10 
10 10 -10 10 -10 -10 -10 0 0 0 0 10 -10 10 10 -10 -10 10 10 10 
10 0 0 0 0 0 0 0 0 0 -10 10 -10 -10 10 -10 0 10 0 0 
0 10 -10 10 -10 -10 -10 10 10 -10 -10 10 -10 -10 10 -10 10 10 -10 -10 
10 -10 10 -10 10 10 10 0 0 0 -10 0 0 -10 0 0 0 10 0 0 
...

I Would greatly appreciate any suggestions for how to accomplish this!

Upvotes: 0

Views: 61

Answers (2)

Etan Reisner
Etan Reisner

Reputation: 80931

You could try sticking a d on the end of each line of recordsToRemove and then run sed -f recordsToRemove largeFile.

Upvotes: 1

Kent
Kent

Reputation: 195069

just give this a try, if it works there:

awk 'NR==FNR{d[0+$0]=7;next}!d[FNR]' recordsToRemove  bigFile > Result.txt

save the rowNo. to be deleted in memory as hashtable entries. 10k entries are nothing for your memory I think.

Then check each row No. in that bigfile, if it exists in the hashtable. the get() method of a hashtable would be considered as O(1), performance should not be a problem. Just give it a try on your real data.

Upvotes: 1

Related Questions