Reputation: 399
I have a very large text files (+10GB) which i want to read for some data mining technics.
To do that, i use parallel technics with MPI so many processes can access together to the same file.
In fact, i want that each process read N number of lines. Since the file is not structured (same number of fields but each field can contain different number of characters), i'm in the obligation to parse the file and that is not parallel and it takes a lot of time.
Is there any way to access directly to a specific number of line withount parsing and counting the lines?
Thank you for you help.
Upvotes: 20
Views: 1404
Reputation: 49283
A few other options beyond what has been mentioned here that will not require scanning the whole file:
make a master process that pushes lines via pipes/fifos to child processes that do the actual processing. This might be a bit slower but if say 90% of the time spent in the subprocesses is the actual crunching of texts, it should be ok.
A stupid but effective trick: say you have N processes, and you can tell each process by argv or something it's "serial number" e.g. processor -serial_number [1|2|3...N] -num_procs N
, they can all read the same data, but process only lines that have lineno % num_procs == serial_number
. it's a bit less efficient because they will all read the entire data, but again, if they only work on every Nth line, and that is what consumes most of the time, you should be fine.
Upvotes: 10
Reputation: 13202
No there isn't: until you don't read through your unknown data nobody will know how many new line characters there are. This problem complexity is O(n) thus meaning that at least once you'll have to read the whole file. Then you might want to build an index table where you record where there are new line characters in your file: this can be used by all process and with fseek you can speed up dramatically further access.
Upvotes: 4
Reputation: 206909
If your file isn't otherwise indexed, there is no direct way.
Indexing it might be worth it (scan it once to find all the line endings, and store the offsets of each line or chunk of lines). If you need to process the file multiple times, and it does not change, the cost of indexing it could be offset by the ease of using the index for further runs.
Otherwise, if you don't need all the jobs to have exactly the same number of lines/items, you could just fudge it.
Seek to a given offset (say 1G), and look for the closest line separator. Repeat at offset 2G, etc. until you've found enough break points.
You can then fire off your parallel tasks on each of the chunks you've identified.
Upvotes: 21