valpa
valpa

Reputation: 367

How to export all pages from mediawiki into individual page files?

Related to How to export text from all pages of a MediaWiki?, but I want the output be individual text files named using the page title.

SELECT page_title, page_touched, old_text
FROM revision,page,text
WHERE revision.rev_id=page.page_latest
AND text.old_id=revision.rev_text_id;

works to dump it into stdout and all pages in one go.

How to split them and dump into individual files?

SOLVED

First dump into one single file:

SELECT page_title, page_touched, old_text
FROM revision,page,text 
WHERE revision.rev_id=page.page_latest AND text.old_id=revision.rev_text_id AND page_namespace!='6' AND page_namespace!='8' AND  page_namespace!='12'
INTO OUTFILE '/tmp/wikipages.csv' 
FIELDS TERMINATED BY '\n' 
ESCAPED BY ''
LINES TERMINATED BY '\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n';

Then split it into individual file, use python:

with open('wikipages.csv', 'rb') as f:
  alltxt = f.read().split('\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n')

  for row in alltxt:
    one = row.split('\n')
    name = one[0].replace('/','-')
    try:
      del one[0]
      del one[0]
    except:
      continue
    txt = '\n'.join(one)
    of = open('/tmp/wikipages/' + name + '.txt', 'w')
    of.write(txt)
    of.close()

Upvotes: 1

Views: 1320

Answers (2)

clockoon
clockoon

Reputation: 1

From Mediawiki version 1.35, multi-content revision model has been implemented, so the original dump code won't work correctly. Instead, you can use following code:

SELECT page_title, page_touched, old_text
FROM revision,page,text,content,slots
WHERE page.page_latest=revision.rev_id AND revision.rev_id=slots.slot_revision_id AND slots.slot_content_id=convert(substring(content.content_address,4),int) AND convert(substring(content.content_address,4),int)=text.old_id AND page_namespace!='6' AND page_namespace!='8' AND  page_namespace!='12'
INTO OUTFILE '/var/tmp/wikipages.csv' 
FIELDS TERMINATED BY '\n' 
ESCAPED BY ''
LINES TERMINATED BY '\n@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@\n';

Upvotes: 0

wakalaka
wakalaka

Reputation: 533

In case you have some python knowledge you can utilize mwclient library to achieve this:

  1. install Python 2.7 sudo apt-get install python2.7 ( see https://askubuntu.com/questions/101591/how-do-i-install-python-2-7-2-on-ubuntu in case of troubles )
  2. install mwclient via pip install mwclient
  3. run python script below

    import mwclient
    wiki = mwclient.Site(('http', 'you-wiki-domain.com'), '/')
    for page in wiki.Pages:
      file = open(page.page_title, 'w')
      file.write(page.text())
      file.close()
    

see mwclient page https://github.com/mwclient/mwclient for reference

Upvotes: 1

Related Questions