Roy
Roy

Reputation: 867

Best way to split a huge file in Python

I need to split a very large file (3GB) ten times in the following way: the first split splits between the first 10% of the lines and the rest of the file, the second split splits between the second 10% of the lines and the rest and so on (This is in order to do cross validation)

I've done this naively by loading the lines of the file to a list, going through the list and writing each line to the right output file by its index. This takes too long since it writes 3GB of data each time.

Is there a better way to do so?

Note: adding # to the start of each line is like deleting it. Would it be smarter to add and remove # to the start of the lines at the start?

EXAMPLE: if the file is [1,2,3,4,5,6,7,8,9,10] then I want to split it like that:

[1] and [2,3,4,5,6,7,8,9,10]
[2] and [1,3,4,5,6,7,8,9,10]
[3] and [1,2,4,5,6,7,8,9,10]

and so on

Upvotes: 3

Views: 154

Answers (1)

jcoppens
jcoppens

Reputation: 5440

I'd suggest trying to reduce the number of files. Even though 30 GB isn't too much with modern disks, it still takes a huge amount of effort (and time) to process.

For example:

  • Assuming you want 10% of the lines, not 10% of the size, you could make an index file with the start of each line, and access the (single, original) text file through the index

  • You could also convert the original file to a fixed record file, so that each text line occupies the same size. Then you could select access by using seek().

Both these functions could be 'hidden' by defining a file-like object in Python. That way you can access the single file as several 'virtual' files, each just showing the part (or parts) you want.

Upvotes: 1

Related Questions