pythonurllib2user-agentwikipediahttp-status-code-403

Reputation: 835

Fetch a Wikipedia article with Python

I try to fetch a Wikipedia article with Python's urllib:

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")           
s = f.read()
f.close()

However instead of the html page I get the following response: Error - Wikimedia Foundation:

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT

Wikipedia seems to block request which are not from a standard browser.

Anybody know how to work around this?

Upvotes: 40

Answers (10)

skierpage

Reputation: 2584

Requesting the page with ?printable=yes gives you an entire relatively clean HTML document. ?action=render gives you just the body HTML. Requesting to parse the page through the MediaWiki action API with action=parse likewise gives you just the body HTML but would be good if you want finer control, see parse API help.

If you just want the page HTML so you can render it, it's faster and better is to use the new RESTBase API, which returns a cached HTML representation of the page. In this case, https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein.

As of November 2015, you don't have to set your user-agent, but it's strongly encouraged. Also, nearly all Wikimedia wikis require HTTPS, so avoid a 301 redirect and make https requests.

Upvotes: 1

Florian Bösch

Reputation: 27766

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

Straight from the examples

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()

Upvotes: 51

Aziz Alto

Reputation: 20301

requests is awesome!

Here is how you can get the html content with requests:

import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text

Done!

Upvotes: 2

mathias

Reputation: 31

In case you are trying to access Wikipedia content (and don't need any specific information about the page itself), instead of using the api you should just call index.php with 'action=raw' in order to get the wikitext, like in:

'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'

Or, if you want the HTML code, use 'action=render' like in:

'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'

You can also define a section to get just part of the content with something like 'section=3'.

You could then access it using the urllib2 module (as sugested in the chosen answer). However, if you need information about the page itself (such as revisions), you'll be better using the mwclient as sugested above.

Refer to MediaWiki's FAQ if you need more information.

Upvotes: 3

Finn Årup Nielsen

Reputation: 6726

import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()

This seems to work for me without changing the user agent. Without the "action=raw" it does not work for me.

Upvotes: 0

sligocki

Reputation: 6387

Rather than trying to trick Wikipedia, you should consider using their High-Level API.

Upvotes: 15

Hannes Ovrén

Reputation: 21831

It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.

I have used it myself for two projects, and it works very well.

Upvotes: 37

Liam

Reputation: 20940

The general solution I use for any site is to access the page using Firefox and, using an extension such as Firebug, record all details of the HTTP request including any cookies.

In your program (in this case in Python) you should try to send a HTTP request as similar as necessary to the one that worked from Firefox. This often includes setting the User-Agent, Referer and Cookie fields, but there may be others.

Upvotes: 2

Gurch

Reputation: 59

You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.

Upvotes: 1

Vasil

Reputation: 38106

Try changing the user agent header you are sending in your request to something like: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)

Upvotes: 1

Fetch a Wikipedia article with Python

Answers (10)

Related Questions