Wikipedia Extractor as a parser for Wikipedia Data Dump File

Question

I've tried to convert bz2 to text with "Wikipedia Extractor(http://medialab.di.unipi.it/wiki/Wikipedia_Extractor). I've downloaded wikipedia dump with bz2 extension then on command line used this line of code:

WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles.xml.bz2

This gave me a result that can be seen in the link:

However, following up it is stated: In order to combine the whole extracted text into a single file one can issue:

> find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
> rm -rf extracted

I get the following error:

File not found - '*bz2'

What can I do?

gsb22 · Accepted Answer

Please go through this. This would help.

Error using the 'find' command to generate a collection file on opencv

The commands mentioned on the WikiExtractor page are for Unix/Linux system and wont work on Windows.

The find command you ran on windows works in different way than the one in unix/linux.

The extracted part works fine on both windows/linux env as long as you run it with python prefix.

python WikiExtractor.py -cb 250K -o extracted your_bz2_file

You would see a extracted folder created in same directory as your script.

After that find command is supposed to work like this, only on linux.

find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml

find everything in the extracted folder that matches with bz2 and then execute bzip2 command on those file and put the result in text.xml file.

Also, if you run bzip -help command, which is supposed to run with the find command above, you would see that it wont work on Windows and for Linux you get the following output.

gaurishankarbadola@ubuntu:~$ bzip2 -help
bzip2, a block-sorting file compressor.  Version 1.0.6, 6-Sept-2010.

   usage: bzip2 [flags and input files in any order]

   -h --help           print this message
   -d --decompress     force decompression
   -z --compress       force compression
   -k --keep           keep (don't delete) input files
   -f --force          overwrite existing output files
   -t --test           test compressed file integrity
   -c --stdout         output to standard out
   -q --quiet          suppress noncritical error messages
   -v --verbose        be verbose (a 2nd -v gives more)
   -L --license        display software version & license
   -V --version        display software version & license
   -s --small          use less memory (at most 2500k)
   -1 .. -9            set block size to 100k .. 900k
   --fast              alias for -1
   --best              alias for -9

   If invoked as `bzip2', default action is to compress.
              as `bunzip2',  default action is to decompress.
              as `bzcat', default action is to decompress to stdout.

   If no file names are given, bzip2 compresses or decompresses
   from standard input to standard output.  You can combine
   short flags, so `-v -4' means the same as -v4 or -4v, &c.

As mentioned above, bzip2 default action is to compress, so use bzcat for decompression.

The modified command that would work only on linux would look like this.

find extracted -name '*bz2' -exec bzcat -c {} \; > text.xml

It works on my ubuntu system.

EDIT :

For Windows :

BEFORE YOU TRY ANYTHING, PLEASE GO THROUGH THE INSTRUCTIONS FIRST

Create a separate folder and put the files in the folder. Files --> WikiExtractor.py and itwiki-latest-pages-articles1.xml-p1p277091.bz2 (in my case, since it is a small file I could find).

2. Open command prompt in current directory and run the following command to extract all the files.

python WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles1.xml-p1p277091.bz2

It will take time based on the file size but now the directory would look like this.

CAUTION : If you already have the extracted folder, move that to current directory so that it matches with the image above and you don't have to do extraction again.

Copy paste the below code and save it in bz2_Extractor.py file.

import argparse
import bz2
import logging

from datetime import datetime
from os import listdir
from os.path import isfile, join, isdir

FORMAT = '%(levelname)s: %(message)s'
logging.basicConfig(format=FORMAT)
logger = logging.getLogger()
logger.setLevel(logging.INFO)


def get_all_files_recursively(root):
    files = [join(root, f) for f in listdir(root) if isfile(join(root, f))]
    dirs = [d for d in listdir(root) if isdir(join(root, d))]
    for d in dirs:
        files_in_d = get_all_files_recursively(join(root, d))
        if files_in_d:
            for f in files_in_d:
                files.append(join(f))
    return files


def bzip_decompress(list_of_files, output_file):
    start_time = datetime.now()
    with open(f'{output_file}', 'w+', encoding="utf8") as output_file:
        for file in list_of_files:
            with bz2.open(file, 'rt', encoding="utf8") as bz2_file:
                logger.info(f"Reading/Writing file ---> {file}")
                output_file.writelines(bz2_file.read())
                output_file.write('
')
    stop_time = datetime.now()
    print(f"Total time taken to write out {len(list_of_files)} files = {(stop_time - start_time).total_seconds()}")


def main():
    parser = argparse.ArgumentParser(description="Input fields")
    parser.add_argument("-r", required=True)
    parser.add_argument("-n", required=False)
    parser.add_argument("-o", required=True)
    args = parser.parse_args()

    all_files = get_all_files_recursively(args.r)
    bzip_decompress(all_files[:int(args.n)], args.o)


if __name__ == "__main__":
    main()

Now the current directory would look like this.

Now open a cmd in current directory and run the following command.

Please read what each input does in the command.

python bz2_Extractor.py -r extracted -o output.txt -n 10

-r : The root directory you have bz2 files in.

-o : Output file name

-n : Number of files to write out. [If not provided, it writes out all the files inside root directory]

CAUTION : I can see that your file is in Gigabytes and it has more than half millions articles. If you try to put that in a single file using above command, I'm not sure what would happen or if your system can survive that and if it did survive that, the output file would be so large, since it is extracted from 2.8GB file, I don't think Windows OS would be able to open it directly.

So my suggestion would be to process 10000 files at a time.

Let me know if this works for you.

PS : For above command, the output looks like this.

Wikipedia Extractor as a parser for Wikipedia Data Dump File

Answers (1)

Related Questions