Reputation: 339
Thanks for your submission to ourdirectory.com URL: http://myurlok.us Please click below link to confirm your submission. http://www.ourdirectory.com/confirm.aspx?id=1247778154270076
Once we receive your comfirmation, your site will be included for process!
regards,
http://www.ourdirectory.com
Thank you!
Should be obvious which URL I need to extract.
Upvotes: 1
Views: 10827
Reputation: 36
This solution works only if the source is not HTML.
def extractURL(self,fileName):
wordsInLine = []
tempWord = []
urlList = []
#open up the file containing the email
file = open(fileName)
for line in file:
#create a list that contains each word in each line
wordsInLine = line.split(' ')
#For each word try to split it with :
for word in wordsInLine:
tempWord = word.split(":")
#Check to see if the word is a URL
if len(tempWord) == 2:
if tempWord[0] == "http" or tempWord[0] == "https":
urlList.append(word)
file.close()
return urlList
Upvotes: 1
Reputation: 824
Check this out.
I wrote a post for the same. The code in this post can extract URLs from an email file, whether it is plain-text or html content-types, or quoted-printable or base 64 or 7bit encodings.
Python - How to extract URLs (plain/html, quote-printable/base64/7bit) from an email file
Upvotes: 0
Reputation: 342629
@OP, if your email is always standard,
f=open("emailfile")
for line in f:
if "confirm your submission" in line:
print f.next().strip()
f.close()
Upvotes: 1
Reputation: 2457
regex:
"http://www.ourdirectory.com/confirm.aspx\?id=[0-9]+$"
or without regex, parse the email line by line and test if the string contains "http://www.ourdirectory.com/confirm.aspx?id=" and if it does, that's your URL.
Of course, if your input is actually the HTML source instead of the text you posted, this all goes out the window.
Upvotes: 0
Reputation: 539
If it's HTML email with hyperlinks you could use the HTMLParse library as a shortcut.
import HTMLParser
class parseLinks(HTMLParser.HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'a':
for name, value in attrs:
if name == 'href':
print value
print self.get_starttag_text()
someHtmlContainingLinks = ""
linkParser = parseLinks()
linkParser.feed(someHtmlContainingLinks)
Upvotes: 2
Reputation: 336378
Not easy. One suggestion (taken from the RegexBuddy library):
\b(?:(?:https?|ftp|file)://|www\.|ftp\.)(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[-A-Z0-9+&@#/%=~_|$?!:,.])*(?:\([-A-Z0-9+&@#/%=~_|$?!:,.]*\)|[A-Z0-9+&@#/%=~_|$])
will match URLs (without mailto:
, if you want that, say so), even if they are enclosed in parentheses. Will also match URLs without http://
or ftp://
etc. if they start with www.
or ftp.
.
A simpler version:
\bhttps?://[-A-Z0-9+&@#/%?=~_|$!:,.;]*[A-Z0-9+&@#/%=~_|$]
It all depends on what your needs are/what your input looks like.
Upvotes: 0