Accessing large data sets and/or storing them

Question

At the moment I am dealing with large amounts of float/double datasets to be used for calculation. I have a set of files to compare Data A to Data B and I would like to compute the Euclidean distance / Cosine similarity. I.E. Data A point 1 iterates through Data B Points to find the nearest neighbour.

The data is given in a text file - no issues with that. What would be an ideal way to go about storing/reading the information?

I would have to repeat Data B for all points in Data A. The data is to be stored as floats. Each Data point may have dimensions. A file may contain up to about 2mil floats.

Should I go about using :

Constantly reading Data B's file and parsing the string (I feel that this is highly inefficient)
Storing the data in a List (An array of floats)
Using a Memory-Map IO?
HashMap (I am relatively new to HashMap, they say that the positions of the collection may change over time, if i am just iterating through with no modifications, will the positions change?)

mbatchkarov · Accepted Answer

2M floats is not that much at all, it will be perfectly fine to put them all in a list. One list for A, one for B. If A and B are multidimensional, float[][] is just fine. If you find you are running out of memory, try loading the whole B first, but one data point from A at a time.

Accessing large data sets and/or storing them

Answers (2)

Related Questions