Reputation: 13

Scrape links from www and save as txt files (Bash or Python)

I have a small project at home, where I need to scrape a website for links every once in a while and save the links in a txt file.

The script need to run on my Synology NAS, therefore the script needs to be written in bash script or python without using any plugins or external libraries as I can't install it on the NAS. (to my knowledge anyhow)

A link looks like this:

<a href="http://www.example.com">Example text</a>

I want to save the following to my text file:

Example text - http://www.example.com

I was thinking I could isolate the text with curl and some grep (or perhaps regex). First I looked into using Scrapy or Beutifulsoup, but couldn't find a way to install it on the NAS.

Could one of you help me put a script together?

Upvotes: 1

Answers (3)

Pedro Lobito

Reputation: 98921

Based on your example, you need something like this:

wget -q -O- https://dl.dropboxusercontent.com/s/wm6mt2ew0nnqdu6/links.html?dl=1 | sed -r 's#<a href="([^"]+)">([^<]+)</a>.*$#\2 - \1#' > links.txt

cat links.txt outputs:

1Visit W3Schools - http://www.w3schools.com/
2Visit W3Schools - http://www.w3schools.com/
3Visit W3Schools - http://www.w3schools.com/
4Visit W3Schools - http://www.w3schools.com/
5Visit W3Schools - http://www.w3schools.com/
6Visit W3Schools - http://www.w3schools.com/
7Visit W3Schools - http://www.w3schools.com/

Upvotes: 0

Sabuj Hassan

Reputation: 39365

You can use urllib2 that ships as free with Python. Using it you can easily get the html of any url

import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()

Now, about the parsing the html. You can still use BeautifulSoup without installing it. From their site, it says "You can also download the tarball and use BeautifulSoup.py in your project directly". So search on internet for that BeautifulSoup.py file. If you can't find it, then download this one and save into a local file inside your project. Then use it like below:

soup = BeautifulSoup(html)
for link in soup("a"):
    print link["href"]
    print link.renderContents()

Upvotes: 2

Kevin

Reputation: 2182

I recommend using Python's htmlparser library. It will parse the page into a hierarchy of objects for you. You can then find the a href tags.

http://docs.python.org/2/library/htmlparser.html

There are lots of examples of using this library to find links, so I won't list all of the code, but here is a working example: Extract absolute links from a page using HTMLParser

EDIT:

As Oday pointed out, the htmlparser is an external library, and you may not be able to load it. In that case, here are two recommendations for built-in modules that can do what you need:

htmllib is included in Python 2.X.
xml is includes in Python 2.X and 3.X.

There is also a good explanation elsewhere on this site for how to use wget & grep to do the same thing:
Spider a Website and Return URLs Only

Upvotes: 0

Scrape links from www and save as txt files (Bash or Python)

Answers (3)

Related Questions