Reputation: 148
At the moment I am dealing with large amounts of float/double datasets to be used for calculation. I have a set of files to compare Data A to Data B and I would like to compute the Euclidean distance / Cosine similarity. I.E. Data A point 1 iterates through Data B Points to find the nearest neighbour.
The data is given in a text file - no issues with that. What would be an ideal way to go about storing/reading the information?
I would have to repeat Data B for all points in Data A. The data is to be stored as floats. Each Data point may have dimensions. A file may contain up to about 2mil floats.
Should I go about using :
Upvotes: 1
Views: 556
Reputation: 16069
2M floats is not that much at all, it will be perfectly fine to put them all in a list. One list for A, one for B. If A and B are multidimensional, float[][] is just fine. If you find you are running out of memory, try loading the whole B first, but one data point from A at a time.
Upvotes: 1
Reputation: 198211
The basic solution is the best one: just a float[][]
. That's almost certainly the most memory-efficient and the fastest solution, and very simple.
Upvotes: 1