designarti
designarti

Reputation: 629

Mac OS terminal solution to remove from a textfile lines from another textfiles

I work in SEO and sometimes I have to manage lists of domains to be considered for certain actions in our campaigns. On my iMac, I have 2 lists, one provided for consideration - unfiltered.txt - and another that has listed the domains I've already analyzed - used.txt. The one provided for consideration, the new one (unfiltered.txt), looks like this:

site1.com
site2.com
domain3.net
british.co.uk
england.org.uk
auckland.co.nz
... etc

List of domains that needs to be used as a filter, to be eliminated (used.txt) - looks like this.

site4.org
site5.me
site6.co.nz
gland.org.uk
kland.co.nz
site7.de
site8.it
... etc

Is there a way to use my OS X terminal to remove from unfiltered.txt all the lines found in used.txt? Found a software solution that partially solves a problem, and, aside from the words from used.txt, eliminates also words containing these smaller words. It means I get a broader filter and eliminate also domains that I still need.

For example, if my unfiltered.txt contains a domain named fogland.org.uk it will be automatically eliminated if in my used.txt file I have a domain named gland.org.uk.

Files are pretty big (close to 100k lines). I have pretty good configuration, with SSD, i7 7th gen, 16GB RAM, but it is unlikely to let it run for hours just for this operation.

... hope it makes sense.

TIA

Upvotes: 1

Views: 111

Answers (4)

mauro
mauro

Reputation: 5950

You can use comm and process substitution to do everything in one line:

comm -23 <(sort used.txt) <(sort unfiltered.txt) > used_new.txt

P.S. tested on my Mac running OSX 10.11.6 (El Capitan)

Upvotes: 0

LSerni
LSerni

Reputation: 57418

I have always used

grep -v -F -f expunge.txt filewith.txt > filewithout.txt

to do this. When "expunge.txt" is too large, you can do it in stages, cutting it into manageable chunks and filtering one after another:

cp filewith.txt original.txt

and loop as required:
    grep -v -F -f chunkNNN.txt filewith.txt > filewithout.txt
    mv filewithout.txt filewith.txt

You could even do this in a pipe:

 grep -v -F -f chunk01.txt original.txt |\
 grep -v -F -f chunk02.txt original.txt |\
 grep -v -F -f chunk03.txt original.txt \
 > purged.txt

Upvotes: 1

Mark Setchell
Mark Setchell

Reputation: 207540

You can do that with awk. You pass both files to awk. Whilst parsing the first file, where the current record number across all files is the same as the record number in the current file, you make a note of each domain you have seen. Then, when parsing the second file, you only print records that correspond to ones you have not seen in the first file:

awk 'FNR==NR{seen[$0]++;next} !seen[$0]' used.txt unfiltered.txt 

Sample Output for your input data

site1.com
site2.com
domain3.net
british.co.uk
england.org.uk
auckland.co.nz

awk is included and delivered as part of macOS - no need to install anything.

Upvotes: 1

user133831
user133831

Reputation: 670

You can use comm. I haven't got a mac here to check but I expect it will be installed by default. Note that both files must be sorted. Then try:

comm -2 -3 unfiltered.txt used.txt

Check the man page for further details.

Upvotes: 0

Related Questions