Reputation: 13
I have a small project at home, where I need to scrape a website for links every once in a while and save the links in a txt file.
The script need to run on my Synology NAS, therefore the script needs to be written in bash script or python without using any plugins or external libraries as I can't install it on the NAS. (to my knowledge anyhow)
A link looks like this:
<a href="http://www.example.com">Example text</a>
I want to save the following to my text file:
Example text - http://www.example.com
I was thinking I could isolate the text with curl and some grep (or perhaps regex). First I looked into using Scrapy or Beutifulsoup, but couldn't find a way to install it on the NAS.
Could one of you help me put a script together?
Upvotes: 1
Views: 1378
Reputation: 98921
Based on your example, you need something like this:
wget -q -O- https://dl.dropboxusercontent.com/s/wm6mt2ew0nnqdu6/links.html?dl=1 | sed -r 's#<a href="([^"]+)">([^<]+)</a>.*$#\2 - \1#' > links.txt
cat links.txt
outputs:
1Visit W3Schools - http://www.w3schools.com/
2Visit W3Schools - http://www.w3schools.com/
3Visit W3Schools - http://www.w3schools.com/
4Visit W3Schools - http://www.w3schools.com/
5Visit W3Schools - http://www.w3schools.com/
6Visit W3Schools - http://www.w3schools.com/
7Visit W3Schools - http://www.w3schools.com/
Upvotes: 0
Reputation: 39365
You can use urllib2
that ships as free with Python. Using it you can easily get the html of any url
import urllib2
response = urllib2.urlopen('http://python.org/')
html = response.read()
Now, about the parsing the html. You can still use BeautifulSoup
without installing it. From their site, it says "You can also download the tarball and use BeautifulSoup.py in your project directly". So search on internet for that BeautifulSoup.py
file. If you can't find it, then download this one and save into a local file inside your project. Then use it like below:
soup = BeautifulSoup(html)
for link in soup("a"):
print link["href"]
print link.renderContents()
Upvotes: 2
Reputation: 2182
I recommend using Python's htmlparser library. It will parse the page into a hierarchy of objects for you. You can then find the a href tags.
http://docs.python.org/2/library/htmlparser.html
There are lots of examples of using this library to find links, so I won't list all of the code, but here is a working example: Extract absolute links from a page using HTMLParser
EDIT:
As Oday pointed out, the htmlparser is an external library, and you may not be able to load it. In that case, here are two recommendations for built-in modules that can do what you need:
htmllib
is included in Python 2.X.
xml
is includes in Python 2.X and 3.X.
There is also a good explanation elsewhere on this site for how to use wget & grep to do the same thing:
Spider a Website and Return URLs Only
Upvotes: 0