shantanuo
shantanuo

Reputation: 32346

Extract text from wikisource XML dump file

Following is a part of Marathi Wikisource dump file.

I am trying to extract the content of tag where Matches sting "My book". Is there any easy to achieve this? Wikisouce is a popular data source and I guess there must be scripts / modules for this.

 <page>
    <title>My book 1</title>
    <ns>0</ns>
    <id>413</id>
    <revision>
      <id>39062</id>
      <parentid>1660</parentid>
      <timestamp>2019-01-21T10:43:05Z</timestamp>
      <contributor>
        <username>Taiven2240</username>
        <id>1373</id>
      </contributor>
      <minor />
      <comment>मराठीकरण</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="215367" xml:space="preserve">{{some Info
}}
&lt;poem&gt;
[[वर्ग:अध्यात्मिक]]
[[वर्ग:तपासणी करायचे साहित्य‎]]</text>
      <sha1>kkx0i4d2tm0zehb5wumrgs60lhric2v</sha1>
    </revision>
  </page>

and what is bytes="215367"?

I downloaded this file from:

https://dumps.wikimedia.org/mrwikisource/20210601/mrwikisource-20210601-pages-meta-current.xml.bz2

Upvotes: 1

Views: 401

Answers (1)

LMC
LMC

Reputation: 12712

A simple albeit not efficient way since it's a 300MB file uncompressed, is to use xmllint on bash command line. Easy to install on Windows with Cygwin, exists by default on Linux (and MacOs I guess :-p). A simple script to search 2 different strings on title tag on a single pass and show the text tag content.

#!/bin/bash

title1=":Ansumang"
title2=":Marathi"

time xmllint --shell mrwikisource-20210601-pages-meta-current.xml <<EOF
setns x=http://www.mediawiki.org/xml/export-0.10/
cat //x:page[x:title[contains(text(),"$title1")] | x:title[contains(text(),"$title2")]]/x:title/text()
EOF

Result:

/ > setns x=http://www.mediawiki.org/xml/export-0.10/
/ > cat //x:page[x:title[contains(text(),":Ansumang")] | x:title[contains(text(),":Marathi")]]/x:title/text()
 -------
सदस्य:Ansumang
 -------
वर्ग:Marathi
 -------
सदस्य चर्चा:Marathipremi101
 -------
चित्र:MarathiTypingCert-1.png
 -------
विकिस्रोत:Marathi Typing Test
 -------
विकिस्रोत:Marathi Font Typing Test
 -------
विकिस्रोत:Marathi font typing test
 -------
विकिस्रोत:Marathi typing speed test
 -------
सदस्य चर्चा:MarathiBot
/ >
real    0m3.223s
user    0m2.904s
sys     0m0.312s

To get the revision/text tag use:

cat //x:page[x:title[contains(text(),"$title1")] | x:title[contains(text(),"$title2")]]/x:revision/x:text/text()

The one-liner is

(t1=':Ansumang'; t2=':Marathi' ; echo 'setns x=http://www.mediawiki.org/xml/export-0.10/'; echo "cat //x:page[x:title[contains(text(),'$t1')] | x:title[contains(text(),'$t2')]]/x:title/text()") | xmllint --shell mrwikisource-20210601-pages-meta-current.xml ; echo

One-liner to get titleand revision/text

(t1=':Ansumang'; t2=':Marathi' ; echo 'setns x=http://www.mediawiki.org/xml/export-0.10/'; echo "cat //x:page[x:title[contains(text(),'$t1')] | x:title[contains(text(),'$t2')]]/descendant::*[self::x:title or self::x:text]/text()") | xmllint --shell mrwikisource-20210601-pages-meta-current.xml ; echo

Upvotes: 1

Related Questions