giogix
giogix

Reputation: 799

how to get all the urls of a website using a crawler or a scraper?

i have to get many urls from a website and then i've to copy these in an excel file. I'm looking for an automatic way to do that. The website is structured having a main page with about 300 links and inside of each link there are 2 or 3 links that are interesting for me. Any suggestions ?

Upvotes: 0

Views: 1553

Answers (4)

Abhishek
Abhishek

Reputation: 5728

You can use beautiful soup for parsing , [http://www.crummy.com/software/BeautifulSoup/]

More information about docs here http://www.crummy.com/software/BeautifulSoup/bs4/doc/

I won't suggest scrappy because you don't need that for work you described in your question.

For e.g. this code will use urllib2 library to open a google homepage and find all links in that output in the form of list

import urllib2
from bs4 import BeautifulSoup

data=urllib2.urlopen('http://www.google.com').read()
soup=BeautifulSoup(data)
print soup.find_all('a')

For handling excel files take a look at http://www.python-excel.org

Upvotes: 0

MinimalMaximizer
MinimalMaximizer

Reputation: 392

If the links are in the html... You can use beautiful soup. This has worked for me in the past.

import urllib2
from bs4 import BeautifulSoup

page = 'http://yourUrl.com'
opened = urllib2.urlopen(page)
soup = BeautifulSoup(opened)

for link in soup.find_all('a'):
    print (link.get('href'))

Upvotes: 1

Srinivasreddy Jakkireddy
Srinivasreddy Jakkireddy

Reputation: 2809

have you tried selenium or urllib?.urllib is faster than selenium http://useful-snippets.blogspot.in/2012/02/simple-website-crawler-with-selenium.html

Upvotes: 0

piokuc
piokuc

Reputation: 26184

If you want to develop your solution in Python then I can recommend Scrapy framework.

As far as inserting the data into an Excel sheet is concerned, there are ways to do it directly, see for example here: Insert row into Excel spreadsheet using openpyxl in Python , but you can also write the data into a CSV file and then import it into Excel.

Upvotes: 1

Related Questions