Fredrick Brennan
Fredrick Brennan

Reputation: 7357

Reading a 6.9GB file causes a segmentation fault

I'm trying to open the latest Japanese Wikipedia database for reading in Python 3.3.1 on Linux, but am getting a Segmentation fault (core dumped) error with this short program:

with open("jawiki-latest-pages-articles.xml") as f:
    text = f.read()

The file itself is quite large:

-rw-r--r-- 1 fredrick users 7368183805 May 17 20:19 jawiki-latest-pages-articles.xml

So it seems like there is an upper limit to just how long a string I can store. What's the best way to tackle this situation?

My end goal is to count the most common characters in the file, sort of like a modern version of Jack Halpern's "Most Commonly Used Kanji in Newspapers". :)

Upvotes: 2

Views: 1379

Answers (2)

Fredrick Brennan
Fredrick Brennan

Reputation: 7357

This is the program I eventually used, if anyone was curious.

from collections import Counter

counter = Counter()

progress = 0
with open("jawiki-latest-pages-articles.xml") as f:
    for line in f:
        progress += 1
        counter.update(line)
        if not progress%10000: print("Processing line {0}..., number {1}".format(line[:10], progress))

output = open("output.txt", "w+")

for k, v in counter.items():
    print("{0}\t{1}".format(k, v), file=output)

output.close()

Upvotes: 0

Adam Rosenfield
Adam Rosenfield

Reputation: 400414

Don't read the whole article at once. Even if your Python distribution is compiled as a 64-bit program (it's simply impossible to allocate more than 4 GB of virtual memory in a 32-bit program), and even if you have enough RAM to store it all, it's still a bad idea to read it all into memory at once.

One simple option is to read it a line at a time and process each line:

with open("jawiki-latest-pages-articles.xml") as f:
    for line in f:
        # Process one line

Alternatively, you can process it in fixed-size chunks:

while True:
    data = f.read(65536)  # Or any other reasonable-sized chunk
    if not data:
        break
    # Process one chunk of data.  Make sure to handle data which overlaps
    # between chunks properly, and make sure to handle EOF properly

Upvotes: 11

Related Questions