Reputation: 21602
I'm curiuos how to work with large files in python?
For example I have dataset on hard drive ~20Gb (just array of numbers) and I want to sort this array to get k min values. So dataset can't be load into memory(RAM).
I think algorithm should be: load dataset by n chunks, find k min in chunk, store k min in memory and process every chunk, so we get k*n values and then sort them to get k min values.
But the question is how to store dataset(what format?), what is the fastest method to load it from disk(what size of chunk I must choose for particular hardware?)Maybe it can be done by using several threads?
Upvotes: 1
Views: 160
Reputation: 18368
You need external sort
instead. If you load everything into memory and sort them, it is named internal sort
. In database, it uses external sort
to do sorting task.
Maybe the following resources would help you.
Upvotes: 1