Reputation: 305
Hello i want to randomize the lines of a 153mil lines text file but the current way i use makes me run out of memory while doing it:
with open(inputfile,'r') as source:
data = [ (random.random(), line) for line in source ]
data.sort()
with open(outputfile,'w') as target:
for _, line in data:
target.write( line )
Upvotes: 2
Views: 154
Reputation: 140168
I've already posted an answer, but it's indeed sub-optimal because it implies creating another file.
Here's a simpler solution:
We have to use binary mode else we could have issues with automatic end-of-line conversions.
import random
current_offset = 0
offsets = []
with open("input.txt","rb") as f:
for line in f:
offsets.append(current_offset)
current_offset += len(line)
offsets.pop() # remove last offset (end of file)
random.shuffle(offsets)
with open("input.txt","rb") as f:
for offset in offsets:
f.seek(offset)
print(f.readline().decode().rstrip # or write to another file
For 153 million lines you still need about 1 to 1.5 gigabytes of RAM to store the indexes (python 3 uses long integers, you could store that in a numpy
array instead to reduce memory). If this is acceptable, this is a very simple solution.
Upvotes: 1
Reputation: 140168
How I would do it without reading the data all in memory and without too much complexity:
First compute the maximum line length in your file (using binary mode)
with open(inputfile,'rb') as source:
max_line_len = max(len(line) for line in source)
Then write another file to disk with the correct padding, so each line has exactly the same size (you need more than twice the size, but since you don't have the memory...). Count the lines at the same time.
with open(inputfile,'rb') as source, open(outputfile,'wb') as dest:
for count,line in enumerate(source):
dest.write(line + b"*"*(max_line_len-len(line))) # write padded
You just created a bigger file, but now the lines have exactly the same length. Well, we padded after the linefeed, which will be useful later. A sample output would be (if max len = 20 for instance):
the first line
****the second line
***another line
******
(not sure of the exact number of stars added but you get the idea, note that the padding char doesn't matter as long as it isn't \n
)
it means that you can seek at the start of any line by a simple multiplication by max_line_len
(like a records file, or a database)
now you can generate the list of line indexes:
indexes = list(range(count+1))
random.shuffle(indexes)
now iterate on this list of indexes, seek to the proper location, read one chunk and split using the fact that we padded after the linefeed, so now we can split it to discard the padding contents.
with open(outputfile,'rb') as source:
for idx in indexes:
source.seek(idx * max_line_len)
random_line = source.read(max_line_len).decode().split("\n")[0]
print(random_line) # or store to another file
I haven't tested this but it should work if you have enough disk. Of course, this is very wasteful if you have one very long line, and the rest are short.
Upvotes: 0
Reputation: 994
Doing some rough back-of-the-napkin calculations, estimating 120 characters per line times 153 M lines... Works out to roughly 18.5 GB of data. (I'm assuming 1 byte per character, but it would be more due to Unicode however... you get the point, though). So you would need at least that amount of RAM in order to fully read in the text. That's why you're getting the out of memory exception on the read in.
One approach you could take is to divide the job up into chunks. Read in portions of the file, randomize those, and then append to a new file, writing to the file and clearing memory as you go. Of course the problem there is that you would only be randomizing within specific chunks.
There are many approaches you could take here, but you can't get around the fact that you can't read in all of that text at once if you don't have the memory.
Edit
I really like Chad's idea of using h5py and HDF5. It's essentially doing all the shuffling in a file on the hard drive... Kind of like forcing a hard drive swap but with more control. I dig it! It does require having h5py though.
Upvotes: 2
Reputation: 1736
Using h5py, you can port your data file to a HDF5 format, and then randomize:
https://stackoverflow.com/a/44866734/3841261
You can use random.shuffle(dataset). This takes a little more than 11 minutes for a 30 GB dataset on my laptop with a Core i5 processor, 8 GB of RAM, and a 256 GB SSD
Upvotes: 3