Reputation: 2317
So I have made a web scraper with Python and some of its libraries... it goes to the given site and gets all links and text from links on that site. I have filtered the results so I'm printing only external links on that site.
the code looks like this:
import urllib
import re
import mechanize
from bs4 import BeautifulSoup
import urlparse
import cookielib
from urlparse import urlsplit
from publicsuffix import PublicSuffixList
link = "http://www.ananda-pur.de/23.html"
newesturlDict = {}
baseAdrInsArray = []
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(link, timeout=10)
for linkins in br.links():
newesturl = urlparse.urljoin(linkins.base_url, linkins.url)
linkTxt = linkins.text
baseAdrIns = linkins.base_url
if baseAdrIns not in baseAdrInsArray:
baseAdrInsArray.append(baseAdrIns)
netLocation = urlsplit(baseAdrIns)
psl = PublicSuffixList()
publicAddress = psl.get_public_suffix(netLocation.netloc)
if publicAddress not in newesturl:
if newesturl not in newesturlDict:
newesturlDict[newesturl,linkTxt] = 1
if newesturl in newesturlDict:
newesturlDict[newesturl,linkTxt] += 1
newesturlCount = sorted(newesturlDict.items(),key=lambda(k,v):(v,k),reverse=True)
for newesturlC in newesturlCount:
print baseAdrInsArray[0]," - ",newesturlC[0],"- count: ", newesturlC[1]
and that prints out result like this:
http://www.ananda-pur.de/23.html - ('http://www.yogibhajan.com/', 'http://www.yogibhajan.com') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kundalini-yoga-zentrum-berlin.de/', 'http://www.kundalini-yoga-zentrum-berlin.de') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.sat-nam-rasayan.de') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.kriteachings.org') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.gurudevsnr.com') - count: 1
http://www.ananda-pur.de/23.html - ('http://www.kriteachings.org/', 'http://www.3ho.de') - count: 1
And my problems are those same links which have different text. According to print example, given sites has 4 links http://www.kriteachings.org/
but as you can see, every of those 4 links have different text
: 1st is http://www.sat-nam-rasayan.de
, 2nd is http://www.kriteachings.org
, 3rd is http://www.gurudevsnr.com
and 4th is http://www.3ho.de
I wanna get print result in which I could see how much time link is on given page but if there are different link text it just appends to other text from same link. To get to the point on this example I would like to get print like this:
http://www.ananda-pur.de/23.html - http://www.yogibhajan.com/ - http://www.yogibhajan.com - count: 1
http://www.ananda-pur.de/23.html - http://www.kundalini-yoga-zentrum-berlin.de - http://www.kundalini-yoga-zentrum-berlin.de - count: 1
http://www.ananda-pur.de/23.html - http://www.kriteachings.org/ - http://www.sat-nam-rasayan.de, http://www.kriteachings.org, http://www.gurudevsnr.com, http://www.3ho.de - count: 4
explanation:
(first link is given page, second is founded link, third link is acutally text of that founded link, and 4th item is how many times that link appear on given site)
My main problem is that I don't know how to compare?!,sort?! or tell the program that this is the same link, and that it should append different text.
Is something like that even possible without too much code? I'm python nooby so I'm little bit lost..
Any help or advice is welcome
Upvotes: 1
Views: 511
Reputation: 473983
Collect links into a dictionary, gather link texts and handle count:
import cookielib
import mechanize
base_url = "http://www.ananda-pur.de/23.html"
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(base_url, timeout=10)
links = {}
for link in br.links():
if link.url not in links:
links[link.url] = {'count': 1, 'texts': [link.text]}
else:
links[link.url]['count'] += 1
links[link.url]['texts'].append(link.text)
# printing
for link, data in links.iteritems():
print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])
prints:
http://www.ananda-pur.de/23.html - index.html - Zadekstr 11,12351 Berlin, - 2
http://www.ananda-pur.de/23.html - 28.html - Das Team - 1
http://www.ananda-pur.de/23.html - http://www.yogibhajan.com/ - http://www.yogibhajan.com - 1
http://www.ananda-pur.de/23.html - 24.html - Kontakt - 1
http://www.ananda-pur.de/23.html - 25.html - Impressum - 1
http://www.ananda-pur.de/23.html - http://www.kriteachings.org/ - http://www.kriteachings.org,http://www.gurudevsnr.com,http://www.sat-nam-rasayan.de,http://www.3ho.de - 4
http://www.ananda-pur.de/23.html - http://www.kundalini-yoga-zentrum-berlin.de/ - http://www.kundalini-yoga-zentrum-berlin.de - 1
http://www.ananda-pur.de/23.html - 3.html - Ergo Oranien 155 - 1
http://www.ananda-pur.de/23.html - 2.html - Physio Bänsch 36 - 1
http://www.ananda-pur.de/23.html - 13.html - Stellenangebote - 1
http://www.ananda-pur.de/23.html - 23.html - Links - 1
Upvotes: 1