The Learner
The Learner

Reputation: 265

A web crawler in python. Where should i start and what should i follow? - Help needed

I have an intermediate knowledge in python. if i have to write a web crawler in python, what things should i follow and where should i begin. is there any specific tut? any advice would be of much help.. thanks

Upvotes: 6

Views: 4079

Answers (8)

Abdul Majid Sheike
Abdul Majid Sheike

Reputation: 121

import re, urllib

textfile = file('depth_1.txt','wt')
print "Enter the URL you wish to crawl.."
print 'Usage  - "http://dynamichackerboys.blogspot.in" <-- With the double quotes'
myurl = input("@> ")
for i in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I):
        print i 
        for ee in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(i).read(), re.I):
                print ee
                textfile.write(ee+'\n')
textfile.close()

print 'Usage - "http://dynamichackerboys.blogspot.in" <-- With the double quotes' myurl = input("@> ") for i in re.findall('''href="'["']''', urllib.urlopen(myurl).read(), re.I): print i for ee in re.findall('''href="'["']''', urllib.urlopen(i).read(), re.I): print ee textfile.write(ee+'\n') textfile.close()

For Crawling in the website

Upvotes: -1

Ryan Ische
Ryan Ische

Reputation: 3676

IBM Developer Works has an article on this https://www.ibm.com/developerworks/linux/library/l-spider/#N101C6. You'll likely want to use the libraries that others have suggested, but this will give you an overall idea of the flow.

Upvotes: 2

marbdq
marbdq

Reputation: 1235

It depends on your needs. If you need basic webscraping, then mechanize + BeautifulSoup will make it.

If you need javascript to be rendered, then I would go for Selenium, or spynner. Both are great.

Upvotes: 2

ratmatz
ratmatz

Reputation: 569

If you still want to write one from scratch, you'll want to use the mechanize module. It includes everything you need to simulate a browser, and automate the fetching of urls. I'll be redundant and also say BeautifulSoup for parsing any html you fetch. Otherwise, I'd go with Scrapy...

Upvotes: 5

Tim McNamara
Tim McNamara

Reputation: 18385

I strongly recommend taking a look at Scrapy. The library can work with BeautifulSoup, or any of your preferred HTML parser. I personally use it with lxml.html.

Out of the box, you receive several things for free:

  • Concurrent requests, thanks to Twisted
  • CrawlSpider objects recursively look for links in the whole site
  • Great separation of data extraction & processing, which makes the most of the parallel processing capabilities

Upvotes: 7

Giljed Jowes
Giljed Jowes

Reputation: 729

Another good library you might need is for parsing feeds. Now that you have BeautifulSoup for urls, you can use Feedparser for the feeds. http://www.feedparser.org/

Upvotes: 1

Giljed Jowes
Giljed Jowes

Reputation: 729

You will surely need an html parsing library. For this you can use BeautifulSoup. You can find lots of samples and tutorials for fetching urls and processing the returned html in the offical page: http://www.crummy.com/software/BeautifulSoup/

Upvotes: 5

gotgenes
gotgenes

Reputation: 40089

Why not look for existing code that already does what you need? If you need to build one yourself, it's still worth looking at existing code and de-constructing it to figure out how it works.

Upvotes: 3

Related Questions