Reputation: 1015
I have a website where there is information about topic (explaining what it is). Each topic has its own webpage. Each webpage is set up the same, and I want to retrieve this information all automatically. I was thinking on using something like wget to grab the info automatically, but Im new with wget so I dont know if it will work nor do I know how I would run it to go to each page and get the information I want.
I hope I've made a little sense here. Like I said, my attempt at the problem is using wget and maybe a python script? Im not asking for a script on how to do it, just looking for some direction.
Upvotes: 0
Views: 42
Reputation: 21
Every once in a while I have the same problem, what I usually do is a small script like this:
url = "www.yoursite.com/topics"
custom_regex = re.compile("insert your a regex here")
req = urllib2.Request(url, headers={"User-Agent": "Magic Browser"})
text = urllib2.urlopen(req).read()
for link in custom_regex.findall(text):
print link
And then use it like this:
python script.py > urls.txt
wget -i urls
The -i
option tells wget to download all urls listed in a file, one url per line.
Upvotes: 2
Reputation: 5919
To retrieve a web page in Python, rather than using wget, I would reccomend using python's urllib2 - https://docs.python.org/2/howto/urllib2.html
Once you have retrieved the web page, you can parse it using BeautifulSoup - http://www.crummy.com/software/BeautifulSoup/bs4/doc/ - it will parse the html for you, and you can go right to the pieces of the web page you want.
Upvotes: 1