Reputation: 835
I try to fetch a Wikipedia article with Python's urllib:
f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")
s = f.read()
f.close()
However instead of the html page I get the following response: Error - Wikimedia Foundation:
Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT
Wikipedia seems to block request which are not from a standard browser.
Anybody know how to work around this?
Upvotes: 40
Views: 27796
Reputation: 2584
Requesting the page with ?printable=yes
gives you an entire relatively clean HTML document. ?action=render
gives you just the body HTML. Requesting to parse the page through the MediaWiki action API with action=parse
likewise gives you just the body HTML but would be good if you want finer control, see parse API help.
If you just want the page HTML so you can render it, it's faster and better is to use the new RESTBase API, which returns a cached HTML representation of the page. In this case, https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein.
As of November 2015, you don't have to set your user-agent, but it's strongly encouraged. Also, nearly all Wikimedia wikis require HTTPS, so avoid a 301 redirect and make https requests.
Upvotes: 1
Reputation: 27766
You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.
Straight from the examples
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()
Upvotes: 51
Reputation: 20301
requests
is awesome!
Here is how you can get the html content with requests
:
import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text
Done!
Upvotes: 2
Reputation: 31
In case you are trying to access Wikipedia content (and don't need any specific information about the page itself), instead of using the api you should just call index.php with 'action=raw' in order to get the wikitext, like in:
'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'
Or, if you want the HTML code, use 'action=render' like in:
'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'
You can also define a section to get just part of the content with something like 'section=3'.
You could then access it using the urllib2 module (as sugested in the chosen answer). However, if you need information about the page itself (such as revisions), you'll be better using the mwclient as sugested above.
Refer to MediaWiki's FAQ if you need more information.
Upvotes: 3
Reputation: 6726
import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()
This seems to work for me without changing the user agent. Without the "action=raw" it does not work for me.
Upvotes: 0
Reputation: 6387
Rather than trying to trick Wikipedia, you should consider using their High-Level API.
Upvotes: 15
Reputation: 21831
It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.
I have used it myself for two projects, and it works very well.
Upvotes: 37
Reputation: 20940
The general solution I use for any site is to access the page using Firefox and, using an extension such as Firebug, record all details of the HTTP request including any cookies.
In your program (in this case in Python) you should try to send a HTTP request as similar as necessary to the one that worked from Firefox. This often includes setting the User-Agent, Referer and Cookie fields, but there may be others.
Upvotes: 2
Reputation: 59
You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.
Upvotes: 1
Reputation: 38106
Try changing the user agent header you are sending in your request to something like: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)
Upvotes: 1