Reputation: 7357
I'm trying to open the latest Japanese Wikipedia database for reading in Python 3.3.1 on Linux, but am getting a Segmentation fault (core dumped)
error with this short program:
with open("jawiki-latest-pages-articles.xml") as f:
text = f.read()
The file itself is quite large:
-rw-r--r-- 1 fredrick users 7368183805 May 17 20:19 jawiki-latest-pages-articles.xml
So it seems like there is an upper limit to just how long a string I can store. What's the best way to tackle this situation?
My end goal is to count the most common characters in the file, sort of like a modern version of Jack Halpern's "Most Commonly Used Kanji in Newspapers". :)
Upvotes: 2
Views: 1379
Reputation: 7357
This is the program I eventually used, if anyone was curious.
from collections import Counter
counter = Counter()
progress = 0
with open("jawiki-latest-pages-articles.xml") as f:
for line in f:
progress += 1
counter.update(line)
if not progress%10000: print("Processing line {0}..., number {1}".format(line[:10], progress))
output = open("output.txt", "w+")
for k, v in counter.items():
print("{0}\t{1}".format(k, v), file=output)
output.close()
Upvotes: 0
Reputation: 400414
Don't read the whole article at once. Even if your Python distribution is compiled as a 64-bit program (it's simply impossible to allocate more than 4 GB of virtual memory in a 32-bit program), and even if you have enough RAM to store it all, it's still a bad idea to read it all into memory at once.
One simple option is to read it a line at a time and process each line:
with open("jawiki-latest-pages-articles.xml") as f:
for line in f:
# Process one line
Alternatively, you can process it in fixed-size chunks:
while True:
data = f.read(65536) # Or any other reasonable-sized chunk
if not data:
break
# Process one chunk of data. Make sure to handle data which overlaps
# between chunks properly, and make sure to handle EOF properly
Upvotes: 11