Demon Labs
Demon Labs

Reputation: 339

Extract URLs out of email in Python

Thanks for your submission to ourdirectory.com URL: http://myurlok.us Please click below link to confirm your submission. http://www.ourdirectory.com/confirm.aspx?id=1247778154270076

Once we receive your comfirmation, your site will be included for process!
regards,

http://www.ourdirectory.com

Thank you!

Should be obvious which URL I need to extract.

Upvotes: 1

Views: 10827

Answers (6)

apocolyp4
apocolyp4

Reputation: 36

This solution works only if the source is not HTML.

def extractURL(self,fileName):

    wordsInLine = []
    tempWord = []
    urlList = []

    #open up the file containing the email
    file = open(fileName)
    for line in file:
        #create a list that contains each word in each line
        wordsInLine = line.split(' ')
        #For each word try to split it with :
        for word in wordsInLine:
            tempWord = word.split(":")
            #Check to see if the word is a URL
            if len(tempWord) == 2:
                if tempWord[0] == "http" or tempWord[0] == "https":
                    urlList.append(word)

    file.close()

    return urlList

Upvotes: 1

gixxer
gixxer

Reputation: 824

Check this out.

I wrote a post for the same. The code in this post can extract URLs from an email file, whether it is plain-text or html content-types, or quoted-printable or base 64 or 7bit encodings.

Python - How to extract URLs (plain/html, quote-printable/base64/7bit) from an email file

Upvotes: 0

ghostdog74
ghostdog74

Reputation: 342629

@OP, if your email is always standard,

f=open("emailfile")
for line in f:
    if "confirm your submission" in line:
        print f.next().strip()        
f.close()

Upvotes: 1

Brienne Schroth
Brienne Schroth

Reputation: 2457

regex:

"http://www.ourdirectory.com/confirm.aspx\?id=[0-9]+$"

or without regex, parse the email line by line and test if the string contains "http://www.ourdirectory.com/confirm.aspx?id=" and if it does, that's your URL.

Of course, if your input is actually the HTML source instead of the text you posted, this all goes out the window.

Upvotes: 0

Matt Garrison
Matt Garrison

Reputation: 539

If it's HTML email with hyperlinks you could use the HTMLParse library as a shortcut.

import HTMLParser
class parseLinks(HTMLParser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for name, value in attrs:
                if name == 'href':
                    print value
                    print self.get_starttag_text()

someHtmlContainingLinks = ""
linkParser = parseLinks()
linkParser.feed(someHtmlContainingLinks)

Upvotes: 2

Tim Pietzcker
Tim Pietzcker

Reputation: 336378

Not easy. One suggestion (taken from the RegexBuddy library):

\b(?:(?:https?|ftp|file)://|www\.|ftp\.)(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])

will match URLs (without mailto:, if you want that, say so), even if they are enclosed in parentheses. Will also match URLs without http:// or ftp:// etc. if they start with www. or ftp..

A simpler version:

\bhttps?://[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]

It all depends on what your needs are/what your input looks like.

Upvotes: 0

Related Questions