Reading a 6.9GB file causes a segmentation fault

Question

I'm trying to open the latest Japanese Wikipedia database for reading in Python 3.3.1 on Linux, but am getting a Segmentation fault (core dumped) error with this short program:

with open("jawiki-latest-pages-articles.xml") as f:
    text = f.read()

The file itself is quite large:

-rw-r--r-- 1 fredrick users 7368183805 May 17 20:19 jawiki-latest-pages-articles.xml

So it seems like there is an upper limit to just how long a string I can store. What's the best way to tackle this situation?

My end goal is to count the most common characters in the file, sort of like a modern version of Jack Halpern's "Most Commonly Used Kanji in Newspapers". :)

Adam Rosenfield · Accepted Answer

Don't read the whole article at once. Even if your Python distribution is compiled as a 64-bit program (it's simply impossible to allocate more than 4 GB of virtual memory in a 32-bit program), and even if you have enough RAM to store it all, it's still a bad idea to read it all into memory at once.

One simple option is to read it a line at a time and process each line:

with open("jawiki-latest-pages-articles.xml") as f:
    for line in f:
        # Process one line

Alternatively, you can process it in fixed-size chunks:

while True:
    data = f.read(65536)  # Or any other reasonable-sized chunk
    if not data:
        break
    # Process one chunk of data.  Make sure to handle data which overlaps
    # between chunks properly, and make sure to handle EOF properly

Reading a 6.9GB file causes a segmentation fault

Answers (2)

Related Questions