Asim
Asim

Reputation: 81

Generating plain text from a Wikipedia database dump

I found a Python script (here: Wikipedia Extractor) that can generate plain text from (English) Wikipedia database dump. When I use this command (as it's stated on the script's page):

$ python enwiki-latest-pages-articles.xml WikiExtractor.py -b 500K -o extracted

I get this error:

File "enwiki-latest-pages-articles.xml", line 1 < mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="en">

^
SyntaxError: invalid syntax

I'm executing the script using Python 2.7.6 & Cygwin on Windows 7.

I hope If anyone has already used this script or experience with Python can help me to solve this error.

Thanks in advance!

Upvotes: 6

Views: 9994

Answers (1)

alecxe
alecxe

Reputation: 473873

The first argument to python should be the script name.

You probably need to swap xml and py file names:

$ python WikiExtractor.py enwiki-latest-pages-articles.xml -b 500K -o extracted

Upvotes: 17

Related Questions