Reputation: 9
I have two sets of scripts. One to download a webpage and another to download links from the webpage. They both run but the links script doesn't return any scripts. Can anyone see or tell me why?
webpage script;
import sys, urllib
def getWebpage(url):
print '[*] getWebpage()'
url_file = urllib.urlopen(url)
page = url_file.read()
return page
def main():
sys.argv.append('http://www.bbc.co.uk')
if len(sys.argv) != 2:
print '[-] Usage: webpage_get URL'
return
else:
print getWebpage(sys.argv[1])
if __name__ == '__main__':
main()
Links Script
import sys, urllib, re
import getWebpage
def print_links(page):
print '[*] print_links()'
links = re.findall(r'\<a.*href\=.*http\:.+', page)
links.sort()
print '[+]', str(len(links)), 'HyperLinks Found:'
for link in links:
print link
def main():
sys.argv.append('http://www.bbc.co.uk')
if len(sys.argv) != 2:
print '[-] Usage: webpage_links URL'
return
page = webpage_get.getWebpage(sys.argv[1])
print_links(page)
Upvotes: 0
Views: 127
Reputation: 49003
This will fix most of your problems:
import sys, urllib, re
def getWebpage(url):
print '[*] getWebpage()'
url_file = urllib.urlopen(url)
page = url_file.read()
return page
def print_links(page):
print '[*] print_links()'
links = re.findall(r'\<a.*href\=.*http\:.+', page)
links.sort()
print '[+]', str(len(links)), 'HyperLinks Found:'
for link in links:
print link
def main():
site = 'http://www.bbc.co.uk'
page = getWebpage(site)
print_links(page)
if __name__ == '__main__':
main()
Then you can move on to fixing your regular expression.
While we are on the topic, though, I have two material recommendations:
requests
for getting web pageslxml
)Upvotes: 1
Reputation: 126
Your regular expression doesn't have an end, so when you find the first it will display you the entire rest of page as you use the http\:.+ which means return all what is : till the end of the html page you need to specify the as end of the regular expression
Upvotes: 0