l00kitsjake
l00kitsjake

Reputation: 1015

Trying to save information on different webpages

I have a website where there is information about topic (explaining what it is). Each topic has its own webpage. Each webpage is set up the same, and I want to retrieve this information all automatically. I was thinking on using something like wget to grab the info automatically, but Im new with wget so I dont know if it will work nor do I know how I would run it to go to each page and get the information I want.

I hope I've made a little sense here. Like I said, my attempt at the problem is using wget and maybe a python script? Im not asking for a script on how to do it, just looking for some direction.

Upvotes: 0

Views: 42

Answers (2)

Rafael Carrascosa
Rafael Carrascosa

Reputation: 21

Every once in a while I have the same problem, what I usually do is a small script like this:

url = "www.yoursite.com/topics"
custom_regex = re.compile("insert your a regex here")
req = urllib2.Request(url, headers={"User-Agent": "Magic Browser"})
text = urllib2.urlopen(req).read()
for link in custom_regex.findall(text):
    print link

And then use it like this:

python script.py > urls.txt
wget -i urls

The -i option tells wget to download all urls listed in a file, one url per line.

Upvotes: 2

AMADANON Inc.
AMADANON Inc.

Reputation: 5919

To retrieve a web page in Python, rather than using wget, I would reccomend using python's urllib2 - https://docs.python.org/2/howto/urllib2.html

Once you have retrieved the web page, you can parse it using BeautifulSoup - http://www.crummy.com/software/BeautifulSoup/bs4/doc/ - it will parse the html for you, and you can go right to the pieces of the web page you want.

Upvotes: 1

Related Questions