Reputation: 613
I have this dataframe:
date,AA,BB,CC
2018-01-01 00:00:00,45.73,0.0,1
2018-01-01 01:00:00,44.16,0.0,2
2018-01-01 02:00:00,42.24,0.0,3
2018-01-01 03:00:00,39.29,0.0,5
2018-01-01 04:00:00,36.0,0.0,6
2018-01-01 05:00:00,41.99,0.0,7
2018-01-01 06:00:00,42.25,0.0,8
I would like to if it is possible to read it with the MPI I/O paradigm.
In particular, I would like to divide the rows according to the number of processors. Suppose yo have 4 processors. I would like that each processor read two lines: processor 0, lines 1,2; processor 1, line 3,4; and so on.
I have studies some materials. As far I have understood I have to do a sort of offset and to write the file in one single line. Another possibility could be use something related to subgrids.
However, as you can noticed there are different kind of variables in each line.
Could someone of you give a glue? What I have found so far about MPI I/O is very theoretical and with no practical examples.
Thanks, Diego
Upvotes: 0
Views: 184
Reputation: 5223
MPI-IO works great for binary data. It is less well suited for text data.
If this were binary data, I would expect a header and an index. Rank 0 could read that header and index, broadcast to everyone where the data resides, and then some algorithmic decomposistion of records could happen (e.g. each rank reads N records)
For an ascii file like this you're right: how do you split up the file?
How big are these files? If they are several megabytes big (so not that large), read the data on rank 0 and distribute from there
Another approach might be to generate an index -- either part of the dataframe or a separate binary index. That index would map records to file offsets and now you can split up the job of reading across all the proceses.
Upvotes: 0