Jim
Jim

Reputation: 1

How to extract an email address from multiple text files

I have approximately 96K text emails that I want to extract the sender's address for. I believe that I can use domdoc for this but need someone to start me off. Can someone please advise whether there is a better way of doing this?

Thanks, Jim

Upvotes: 0

Views: 746

Answers (2)

matiasf
matiasf

Reputation: 1118

See no reason to do this in PHP... Provided the files are in some form of flat text, copy the file(s) to (for example) the emails/ directory, then

cat * | grep "From: " | egrep -oi ‘\b[A-Za-z0-9._%-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}’ | sort | uniq > mail.list

Of course if you have to do this in PHP then

  1. Copy the files/mails to a directory
  2. Get a list of the files with readdir()
  3. Read the file(s)
  4. Split the header from to a separate string
  5. Do a preg_match() on this string to find an email address and put it to $email_arr
  6. When finished, do array_unique() on the $email_arr.

Upvotes: 2

Uprock7
Uprock7

Reputation: 1356

Using a regular expression in some form would be the best way to do it. If you can save your text emails to files, you can use something like Textpad to search for email addresses based on the regular expression.

You should be able to find regular expressions for email addresses online.

Upvotes: 0

Related Questions