Reading Very Large One Liner Text File

I have a 30MB .txt file, with one line of data (30 Million Digit Number)
Unfortunately, every method I've tried (mmap.read(), readline(), allocating 1GB of RAM, for loops) takes 45+ minutes to completely read the file. Every method I found on the internet seems to work on the fact that each line is small, therefore the memory consumption is only as big as the biggest line in the file. Here's the code I've been using.

start = time.clock()
z = open('Number.txt','r+') 
m = mmap.mmap(z.fileno(), 0)
global a
a = int(m.read())
z.close()
end = time.clock()
secs = (end - start)
print("Number read in","%s" % (secs),"seconds.", file=f)
print("Number read in","%s" % (secs),"seconds.")
f.flush()
del end,start,secs,z,m

Other than splitting the number from one line to various lines; which I'd rather not do, is there a cleaner method which won't require the better part of an hour?

By the way, I don't necessarily have to use text files.

I have: Windows 8.1 64-Bit, 16GB RAM, Python 3.5.1

Upvotes: 6

Answers (3)

Mark Tolonen

Reputation: 177461

The file read is quick (<1s):

with open('number.txt') as f:
    data = f.read()

Converting a 30-million-digit string to an integer, that's slow:

z=int(data) # still waiting...

If you store the number as raw big- or little-endian binary data, then int.from_bytes(data,'big') is much quicker.

If I did my math right (Note _ means "last line's answer" in Python's interactive interpreter):

>>> import math
>>> math.log(10)/math.log(2)  # Number of bits to represent a base 10 digit.
3.3219280948873626
>>> 30000000*_                # Number of bits to represent 30M-digit #.
99657842.84662087
>>> _/8                       # Number of bytes to represent 30M-digit #.
12457230.35582761             # Only ~12MB so file will be smaller :^)
>>> import os
>>> data=os.urandom(12457231) # Generate some random bytes
>>> z=int.from_bytes(data,'big')  # Convert to integer (<1s)
99657848
>>> math.log10(z)   # number of base-10 digits in number.
30000001.50818886

EDIT: FYI, my math wasn't right, but I fixed it. Thanks for 10 upvotes without noticing :^)

Upvotes: 10

Master-chip

Reputation: 145

I used the gmpy2 module to convert the string to a number.

start = time.clock()  
z=open('Number.txt','r+') 
data=z.read()
global a
a=gmpy2.mpz(data)
end = time.clock()
secs = (end - start)
print("Number read in","%s" % (secs),"seconds.", file=f)
print("Number read in","%s" % (secs),"seconds.")
f.flush()
del end,secs,start,z,data

It worked in 3 seconds, much slower, but at least it gave me an integer value.

Thank you all for your invaluable answers, however I'm going to mark this one as soon as possible.

Upvotes: 1

tobspr

Reputation: 8376

A 30MB text file should not take very long to read, modern hard drives should be able to do this in less than a second (not counting access time)

Using the standard python file IO should work fine in this case:

with open('my_file', 'r') as handle:
    content = handle.read()

Using this on my laptop yields times much less than a second.

However, converting those 30 MB to an integer is your bottleneck, since python cannot represent this with the long datatype.

You can have a try with the Decimal module, however it is mainly designed for floating point arithmetic.

Besides of that, there is numpy of course, which might be faster (and since you probably want to do some work with the number later on, it would make sense to use such a library).

Upvotes: 3

Reading Very Large One Liner Text File

Answers (3)

Related Questions