Python web scraper, same link with different text, counting

Question

So I have made a web scraper with Python and some of its libraries... it goes to the given site and gets all links and text from links on that site. I have filtered the results so I'm printing only external links on that site.

the code looks like this:

import urllib
import re
import mechanize
from bs4 import BeautifulSoup
import urlparse
import cookielib
from urlparse import urlsplit
from publicsuffix import PublicSuffixList

link = "http://www.ananda-pur.de/23.html"

newesturlDict = {}
baseAdrInsArray = []



br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(link, timeout=10)


for linkins in br.links():

    newesturl = urlparse.urljoin(linkins.base_url, linkins.url)

    linkTxt = linkins.text
    baseAdrIns = linkins.base_url

    if baseAdrIns not in baseAdrInsArray:
        baseAdrInsArray.append(baseAdrIns)

    netLocation = urlsplit(baseAdrIns)
    psl = PublicSuffixList()
    publicAddress = psl.get_public_suffix(netLocation.netloc)

    if publicAddress not in newesturl:

        if newesturl not in newesturlDict:
            newesturlDict[newesturl,linkTxt] = 1
        if newesturl in newesturlDict:
            newesturlDict[newesturl,linkTxt] += 1

newesturlCount = sorted(newesturlDict.items(),key=lambda(k,v):(v,k),reverse=True)
for newesturlC in newesturlCount:
    print baseAdrInsArray[0]," - ",newesturlC[0],"- count: ", newesturlC[1]

and that prints out result like this:

http://www.ananda-pur.de/23.html  -  ('http://www.yogibhajan.com/',  'http://www.yogibhajan.com') - count:  1
http://www.ananda-pur.de/23.html  -  ('http://www.kundalini-yoga-zentrum-berlin.de/', 'http://www.kundalini-yoga-zentrum-berlin.de') - count:  1
http://www.ananda-pur.de/23.html  -  ('http://www.kriteachings.org/', 'http://www.sat-nam-rasayan.de') - count:  1
http://www.ananda-pur.de/23.html  -  ('http://www.kriteachings.org/', 'http://www.kriteachings.org') - count:  1
http://www.ananda-pur.de/23.html  -  ('http://www.kriteachings.org/', 'http://www.gurudevsnr.com') - count:  1
http://www.ananda-pur.de/23.html  -  ('http://www.kriteachings.org/', 'http://www.3ho.de') - count:  1

And my problems are those same links which have different text. According to print example, given sites has 4 links http://www.kriteachings.org/ but as you can see, every of those 4 links have different text: 1st is http://www.sat-nam-rasayan.de, 2nd is http://www.kriteachings.org, 3rd is http://www.gurudevsnr.com and 4th is http://www.3ho.de

I wanna get print result in which I could see how much time link is on given page but if there are different link text it just appends to other text from same link. To get to the point on this example I would like to get print like this:

http://www.ananda-pur.de/23.html  -  http://www.yogibhajan.com/ - http://www.yogibhajan.com - count:  1
http://www.ananda-pur.de/23.html  -  http://www.kundalini-yoga-zentrum-berlin.de - http://www.kundalini-yoga-zentrum-berlin.de - count:  1
http://www.ananda-pur.de/23.html  -  http://www.kriteachings.org/ - http://www.sat-nam-rasayan.de, http://www.kriteachings.org, http://www.gurudevsnr.com, http://www.3ho.de  - count:  4

explanation:

(first link is given page, second is founded link, third link is acutally text of that founded link, and 4th item is how many times that link appear on given site)

My main problem is that I don't know how to compare?!,sort?! or tell the program that this is the same link, and that it should append different text.

Is something like that even possible without too much code? I'm python nooby so I'm little bit lost..

Any help or advice is welcome

alecxe · Accepted Answer

Collect links into a dictionary, gather link texts and handle count:

import cookielib

import mechanize


base_url = "http://www.ananda-pur.de/23.html"

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent',
                  'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(base_url, timeout=10)

links = {}
for link in br.links():
    if link.url not in links:
        links[link.url] = {'count': 1, 'texts': [link.text]}
    else:
        links[link.url]['count'] += 1
        links[link.url]['texts'].append(link.text)

# printing
for link, data in links.iteritems():
    print "%s - %s - %s - %d" % (base_url, link, ",".join(data['texts']), data['count'])

prints:

http://www.ananda-pur.de/23.html - index.html - Zadekstr 11,12351 Berlin, - 2
http://www.ananda-pur.de/23.html - 28.html - Das Team - 1
http://www.ananda-pur.de/23.html - http://www.yogibhajan.com/ - http://www.yogibhajan.com - 1
http://www.ananda-pur.de/23.html - 24.html - Kontakt - 1
http://www.ananda-pur.de/23.html - 25.html - Impressum - 1
http://www.ananda-pur.de/23.html - http://www.kriteachings.org/ - http://www.kriteachings.org,http://www.gurudevsnr.com,http://www.sat-nam-rasayan.de,http://www.3ho.de - 4
http://www.ananda-pur.de/23.html - http://www.kundalini-yoga-zentrum-berlin.de/ - http://www.kundalini-yoga-zentrum-berlin.de - 1
http://www.ananda-pur.de/23.html - 3.html - Ergo Oranien 155 - 1
http://www.ananda-pur.de/23.html - 2.html - Physio Bänsch 36 - 1
http://www.ananda-pur.de/23.html - 13.html - Stellenangebote - 1
http://www.ananda-pur.de/23.html - 23.html - Links - 1

Python web scraper, same link with different text, counting

Answers (1)

Related Questions