Reputation: 19999

BASH - How to check for duplicate email addresses across multiple files?

I'm currently working on a project where I need to send an email to a large number of email addresses. As such I am attempting to avoid any "temporary" glitches with respect to service providers throttling emails etc.

My plan is to take the initial list of email addresses and chop it up into smaller (chopped) lists, so that they can be scheduled in a staggered manner. Due to the sensitive nature of sending emails, I want to ensure that no duplicate email addresses exist across any of the chopped lists. Is there a way to do this via bash?

Side note, I am 100% certain that all email addresses in the master list are unique, due to the nature of the query used to comprise the list, I would just like to ensure, my script which chopped the master list, does not have a defect creating duplicate email addresses across the chopped lists.

Upvotes: 1

Answers (3)

Todd A. Jacobs

Reputation: 84393

The Problem

You need to sort unique addresses, and then split the ordered list into chunks.

The Solution

Given the following assumptions:

Your emails are stored in files called emails_xxxx.txt. (Note: You can name them anything you like, but a sensible set of filenames that are easy to glob will make your life simpler.)
Each line holds one address.

you can handle this with a short pipeline. Sort will accept a glob pattern or multiple file arguments (e.g. from xargs), so you can avoid the "useless use of cat." You then pipe the output into split, where you can control various aspects of the chunking. For example:

sort --unique emails_*.txt |
split --numeric-suffixes \
      --lines=200 \ 
      --suffix-length=4 \
      --verbose

This splits the sorted/filtered lines into chunks of up to 200 lines each, and names each chunk with a numeric extension suitable for batch processing. You can adjust the lines and suffix length to suit your requirements.

Sample Output

creating file `x0000'
creating file `x0001'

Upvotes: 1

Jan-Philipp Niewerth

Reputation: 161

Try

 cat *.txt | sort | sort -u -c

given that your filenames are ending with .txt. The first sort command orders all email addresses. The second sort command checks that no two consecutive lines are equal and throws an error in the other case.

Upvotes: 2

timos

Reputation: 2707

You can put the chopped files together (temporarily) via cat and use sort --unique to remove duplicates - then check if the result has as many lines as the original file:

cat original_list | wc -l

and

cat list_part* | sort --unique | wc -l

if the results are same there are no duplicates.