Reputation: 861
I've tried to convert bz2 to text with "Wikipedia Extractor(http://medialab.di.unipi.it/wiki/Wikipedia_Extractor). I've downloaded wikipedia dump with bz2 extension then on command line used this line of code:
WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles.xml.bz2
This gave me a result that can be seen in the link:
However, following up it is stated: In order to combine the whole extracted text into a single file one can issue:
> find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
> rm -rf extracted
I get the following error:
File not found - '*bz2'
What can I do?
Upvotes: 1
Views: 2967
Reputation: 2180
Please go through this. This would help.
Error using the 'find' command to generate a collection file on opencv
The commands mentioned on the WikiExtractor page are for Unix/Linux system and wont work on Windows.
The find
command you ran on windows works in different way than the one in unix/linux.
The extracted part works fine on both windows/linux env as long as you run it with python prefix.
python WikiExtractor.py -cb 250K -o extracted your_bz2_file
You would see a extracted
folder created in same directory as your script.
After that find
command is supposed to work like this, only on linux.
find extracted -name '*bz2' -exec bzip2 -c {} \; > text.xml
find everything in the
extracted
folder that matches with bz2 and then executebzip2
command on those file and put the result in text.xml file.
Also, if you run bzip -help
command, which is supposed to run with the find
command above, you would see that it wont work on Windows and for Linux you get the following output.
gaurishankarbadola@ubuntu:~$ bzip2 -help
bzip2, a block-sorting file compressor. Version 1.0.6, 6-Sept-2010.
usage: bzip2 [flags and input files in any order]
-h --help print this message
-d --decompress force decompression
-z --compress force compression
-k --keep keep (don't delete) input files
-f --force overwrite existing output files
-t --test test compressed file integrity
-c --stdout output to standard out
-q --quiet suppress noncritical error messages
-v --verbose be verbose (a 2nd -v gives more)
-L --license display software version & license
-V --version display software version & license
-s --small use less memory (at most 2500k)
-1 .. -9 set block size to 100k .. 900k
--fast alias for -1
--best alias for -9
If invoked as `bzip2', default action is to compress.
as `bunzip2', default action is to decompress.
as `bzcat', default action is to decompress to stdout.
If no file names are given, bzip2 compresses or decompresses
from standard input to standard output. You can combine
short flags, so `-v -4' means the same as -v4 or -4v, &c.
As mentioned above, bzip2 default action is to compress, so use bzcat for decompression.
The modified command that would work only on linux would look like this.
find extracted -name '*bz2' -exec bzcat -c {} \; > text.xml
It works on my ubuntu system.
EDIT :
For Windows :
BEFORE YOU TRY ANYTHING, PLEASE GO THROUGH THE INSTRUCTIONS FIRST
WikiExtractor.py
and itwiki-latest-pages-articles1.xml-p1p277091.bz2
(in my case, since it is a small file I could find).2. Open command prompt in current directory and run the following command to extract all the files.
python WikiExtractor.py -cb 250K -o extracted itwiki-latest-pages-articles1.xml-p1p277091.bz2
It will take time based on the file size but now the directory would look like this.
CAUTION : If you already have the extracted folder, move that to current directory so that it matches with the image above and you don't have to do extraction again.
bz2_Extractor.py
file.import argparse
import bz2
import logging
from datetime import datetime
from os import listdir
from os.path import isfile, join, isdir
FORMAT = '%(levelname)s: %(message)s'
logging.basicConfig(format=FORMAT)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def get_all_files_recursively(root):
files = [join(root, f) for f in listdir(root) if isfile(join(root, f))]
dirs = [d for d in listdir(root) if isdir(join(root, d))]
for d in dirs:
files_in_d = get_all_files_recursively(join(root, d))
if files_in_d:
for f in files_in_d:
files.append(join(f))
return files
def bzip_decompress(list_of_files, output_file):
start_time = datetime.now()
with open(f'{output_file}', 'w+', encoding="utf8") as output_file:
for file in list_of_files:
with bz2.open(file, 'rt', encoding="utf8") as bz2_file:
logger.info(f"Reading/Writing file ---> {file}")
output_file.writelines(bz2_file.read())
output_file.write('\n')
stop_time = datetime.now()
print(f"Total time taken to write out {len(list_of_files)} files = {(stop_time - start_time).total_seconds()}")
def main():
parser = argparse.ArgumentParser(description="Input fields")
parser.add_argument("-r", required=True)
parser.add_argument("-n", required=False)
parser.add_argument("-o", required=True)
args = parser.parse_args()
all_files = get_all_files_recursively(args.r)
bzip_decompress(all_files[:int(args.n)], args.o)
if __name__ == "__main__":
main()
Please read what each input does in the command.
python bz2_Extractor.py -r extracted -o output.txt -n 10
-r
: The root directory you have bz2 files in.
-o
: Output file name
-n
: Number of files to write out. [If not provided, it writes out all the files inside root directory]
CAUTION : I can see that your file is in Gigabytes and it has more than half millions articles. If you try to put that in a single file using above command, I'm not sure what would happen or if your system can survive that and if it did survive that, the output file would be so large, since it is extracted from 2.8GB file, I don't think Windows OS would be able to open it directly.
So my suggestion would be to process 10000 files at a time.
Let me know if this works for you.
PS : For above command, the output looks like this.
Upvotes: 3