Reputation: 3191

Efficient representation of data in a datastructure

I have the following format:

SOLEXA3_1:3:5:1473:616/1    gi|7367913151|ref|NC_007367.1|  100.00  23  0   0   27  49  3404561 3404539 1e-5    46.1
SOLEXA3_1:3:5:1473:616/1    gi|73921565|ref|NC_007367.1|    100.00  23  0   0   27  49  3404561 3404539 1e-5    46.1
SOLEXA3_1:3:5:1474:616/1    gi|32140171|ref|NC_007367.1|    100.00  23  0   0   27  49  3404561 3404539 1e-2    46.1
SOLEXA3_1:3:5:1474:616/1    gi|7354921565|ref|NC_007367.1|  100.00  23  0   0   27  49  3404561 3404539 1e-5    46.1
SOLEXA3_1:3:5:1475:616/1    gi|73921565|ref|NC_007367.1|    100.00  23  0   0   27  49  3404561 3404539 1e-5    46.1
SOLEXA3_1:3:5:1475:616/1    gi|73921565|ref|NC_007367.1|    100.00  23  0   0   27  49  3404561 3404539 1e-5    46.1

Basically it is a tab-delimited file and I will have multiple hits for inputs (the first field: SOLEXA3_1:3:5:1474:616/1 as an example) and multiple hits for a particular input: 32140171 and 7354921565 for the aforementioned example input). What I want to do is build some sort of in-memory representation of all hits for a particular read and the quality associated with every hits - it is the penultimate field - 1e-5 and 1e-2 for the aforementioned 2 hits. So what I have done is the following:

I have a Map<String, ArrayList<TObjectDoubleMap<String>>>. Where every String is basically the input ID and the ArrayList consists of a map from the Trove library which holds a pair of String, double - the string being the Id of the hit and the score. My input file is around 18 millions of lines and with a heap of -Xmx12g I'm getting heap out of memory. Any ideas how I can optimize the memory usage? Bear in mind that the actual scores would vary so I don't think sharing them is feasible.

Upvotes: 0

Answers (3)

Tom Anderson

Reputation: 47183

I think your approach of using a map of lists is basically good, but could be engineered to be more compact.

Firstly, make sure you are canonicalising the read names. That is, there should only be one instance of a string with the characters "SOLEXA3_1:3:5:1473:616/1" in memory; use a map to reduce the names to a canonical instance before using them.

Secondly, are the hit identifiers always integers? If so, store them as such (as longs, since some are evidently too big to fit in ints).

Thirdly, i think you can store the hits and their scores in a very compact structure, as long as you're prepared to do some work, by manually packing both into a long (!). Then, you can just store a sorted array of longs for each input.

Here's how i'd canonicalise the read names:

Map<String, String> names = new HashMap<String, String>();

public String getCanonicalInstanceOfName(String name) {
    String canonicalName = names.get(name);
    if (canonicalName != null) {
        name = canonicalName;
    }
    else {
        names.put(name, name);
    }
    return name;
}

I can think of at least one smarter way to do this, but this will do for now.

Here's how i'd handle the hit scoring:

public long packHitIDAndScore(String id, float score) {
    long numericID = Long.parseLong(id);
    int scoreAsHundredthsOfAPercent = (int)(score * 100.0);
    long packedIDAndScore = (numericID << 14) + scoreAsHundredthsOfAPercent;
    return packedIDAndScore;
}

The 14 there is because 14 binary bits are big enough to hold values up to 16384, which is enough to contain the range 0 to 10000. Note that you only get 50 bits of storage for the ID, so it would be worth checking that no ID was bigger than 1125899906842623.

Since you're using Trove, you can store the packed longs in a TLongArrayList. Keep the list sorted by using binarySearch to find the appropriate place for each long joining the list, and insert to put it there. To look up a value in the list, use a binarySearch again.

Upvotes: 1

Matt MacLean

Reputation: 19658

I would use:

Map<String, ByteArrayOutputStream> map = new HashMap<String, ByteArrayOutputStream>();

Where the Key is simply a concatenation of the 2 fields, and you write the quality and score's to the ByteArrayOutputStream.

The resulting data structure would look something like:

Key:    "SOLEXA3_1:3:5:1474:616/1_32140171"
Value:  |5|46.1|2|46.1|  //where this is actually just a byte[]

Then when reading the qualities and scores you just use readByte() and readDouble() until you get to the end of the stream.

Of course, doing it this way makes querying stuff a little trickier, but you will save huge on memory allocation.

Ex:

for ( String[] fields : rows ) {
    Map<String, ByteArrayOutputStream> map = new HashMap<String, ByteArrayOutputStream>();
    String key = fields[0] + "_" + fields[1];
    byte quality = Byte.parseByte(fields[10].substring(3));
    double score = Double.parseDouble(fields[11]);

    if ( !map.containsKey(key) ) {
        map.put(key, new ByteArrayOutputStream());
    }
    DataOutputStream dos = new DataOutputStream(map.get(key));
    dos.writeByte(quality);
    dos.writeDouble(score);
}


//reading
for ( String key : map.keySet() ) {
    ByteArrayOutputStream baos = map.get(key);
    int numHits = baos.size()/9; //1 byte for quality, 8 for score
    DataInputStream din = new DataInputStream(new ByteArrayInputStream(baos.toByteArray()));
    System.out.print( key + " - " + numHits);
    while ( din.available() > 0 ) {
        byte quality = din.readByte();
        double score = din.readDouble();
        System.out.print(" (" + quality + ", " + score + ")");
    }
    System.out.print("\n");
}

Using this method I can read and store ~20mil records in <1GB of memory. (In around 10 seconds on a MacBook Pro).

Upvotes: 1

lsoliveira

Reputation: 4640

There are several options available for this but I would use an embedded database (but only if a "traditional" database is out of question) - H2, for instance. Unless you're doing so very heavy computations on the resulting data, that would be a safe bet.

Just a quick list of other options:

ETL systems (granted that it requires a database)
NIO if the out of memory is happening because of the way the file is being read
"Lighter" data-structures (tries?, FastUtil)

You may even use a combination of them all.

Upvotes: 0

Efficient representation of data in a datastructure

Answers (3)

Related Questions