Reputation: 645
Reading in a lot of data from a file. There may be 100 different data objects with necessary headings, but there can be well over 300,000 values stored in each of these data objects. The values need to be stored in the same order that they are read in. This is the constructor for the data object:
public Data(String heading, ArrayList<Float> values) {
this.heading = heading;
this.values = values;
}
What would be the quickest way to store and retrieve these values sequentially in RAM?
Upvotes: 0
Views: 11912
Reputation: 41223
Although in your comments you mention "quickness", without specifying what operation needs to be "quick", your main concern seems to be heap memory consumption.
Let's assume 100 groups of 300,000 numbers (you've used words like "may be" and "well over" but this will do as an example).
That's 30,000,000 numbers to store, plus 100 headings and some structural overhead for grouping.
A primitive Java float
is 32 bits, that is 4 bytes. So at an absolute minimum, you're going to need 30,000,000 * 4 bytes == 120MB.
An array of primitives - float[30000000]
- is just all the values concatenated into a contiguous chunk of memory, so will consume this theoretical minumum of 120MB -- plus a few bytes of once-per-array overhead that I won't go into detail about here.
A java Float
wrapper object is 12 bytes. When you store an object (rather than a primitive) in an array, the reference itself is 4 bytes. So an array of Float
- Float[30000000]
will consume 30,000,000 * (12 + 4) == 480MB.
So, you can cut your memory use by more than half by using primitives rather than wrappers.
An ArrayList
is quite a light wrapper around an array of Object
and so has about the same memory costs. The once-per-list overheads are too small to have an impact compared to the elements, at these list sizes. But there are some caveats:
ArrayList
can only store Objects, not primitives, so if you choose a List
you're stuck with the 12-bytes-per-element overhead of Float
.
ArrayList
is dynamic, and to achieve this, if you grow the list to be bigger than its backing array, it will:
ArrayList.add()
will replace the array with one of 45 million elements, even if your List
only needs 30,000,001.ArrayList.trimToSize()
to drop unneeded capacity and claw some memory back after you've filled the ArrayList
.If I was striving to use as little heap memory as possible, I would aim to store my lists of numbers as arrays of primitives:
class Data {
String header;
float[] values;
}
... and I would just put these into an ArrayList<Data>
.
With this structure, you have O(1) access to arbitrary values, and you can use Arrays.binarySearch()
(if the values are sorted) to find by value within a group.
If at all possible, I would find out the size of each group before reading the values, and initialise the array to the right size. If you can, make your input file format facilitate this:
while(line = readLine()) {
if(isHeader(line)) {
ParsedHeader header = new ParsedHeader(line);
currentArray = new float[header.size()];
arrayIndex = 0;
currentGroup = new Group(header.name(), currentArray);
groups.add(currentGroup);
} else if (isValue(line)) {
currentArray[arrayIndex++] = parseValue(line);
}
}
If you can't change the input format, consider making two passes through the file - once to discover group lengths, once again to fill your arrays.
If you have to consume the file in one pass, and the file format can't provide group lengths before groups, then you'll have to do something that allows a "list" to grow arbitrarily. There are several options:
Consume each group into an ArrayList<Float>
- when the group is complete, convert it into an array[float]
:
float[] array = new float[list.size()];
int i = 0;
for (Float f : list) {
array[i] = f; // auto-unboxes Float to float
}
However none of this considers your reasons for slurping all these numbers into memory in the first place, nor whether this store meets your needs when it comes to processing the numbers.
You should step back and consider what your actual data processing requirement is, and whether slurping into memory is the best approach.
See whether you can do your processing by storing only a slice of data at a time, rather than storing the whole thing in memory. For example, to calculate max/min/mean, you don't need every number to be in memory -- you just need to keep a running total.
Or, consider using a lightweight database library.
Upvotes: 4
Reputation: 304
You could use a RedBlack BST, which will be an extremely efficient way to store/retrieve data. This relies on nodes that link to other nodes, so there's no limit to the size of the input, as long as you have enough memory for java.
Upvotes: -2