Reputation: 867
I need to split a very large file (3GB) ten times in the following way: the first split splits between the first 10% of the lines and the rest of the file, the second split splits between the second 10% of the lines and the rest and so on (This is in order to do cross validation)
I've done this naively by loading the lines of the file to a list, going through the list and writing each line to the right output file by its index. This takes too long since it writes 3GB of data each time.
Is there a better way to do so?
Note: adding #
to the start of each line is like deleting it. Would it be smarter to add and remove #
to the start of the lines at the start?
EXAMPLE: if the file is [1,2,3,4,5,6,7,8,9,10] then I want to split it like that:
[1] and [2,3,4,5,6,7,8,9,10]
[2] and [1,3,4,5,6,7,8,9,10]
[3] and [1,2,4,5,6,7,8,9,10]
and so on
Upvotes: 3
Views: 154
Reputation: 5440
I'd suggest trying to reduce the number of files. Even though 30 GB isn't too much with modern disks, it still takes a huge amount of effort (and time) to process.
For example:
Assuming you want 10% of the lines, not 10% of the size, you could make an index file with the start of each line, and access the (single, original) text file through the index
You could also convert the original file to a fixed record file, so that each text line occupies the same size. Then you could select access by using seek().
Both these functions could be 'hidden' by defining a file-like object in Python. That way you can access the single file as several 'virtual' files, each just showing the part (or parts) you want.
Upvotes: 1