Reputation: 33359
I've got some code to convert a large (many gigabytes) XML file into another format.
Among other things, I need to store one or two gigabytes of floats in a hash table (two floats for each entry), with an int as the value's key.
Currently, I'm using NSMutableDictionary and a custom class containing the two floats:
// create the dictionary
NSMutableDictionary *points = [[NSMutableDictionary alloc] init];
// add an entry (the data is read from an XML file using libxml)
int pointId = 213453;
float x = 42.313554;
float y = -21.135213;
MyPoint *point = [[MyPoint alloc] initWithX:x Y:y];
[points setObject:point forKey:[NSNumber numberWithInt:pointId]];
[point release];
// retrieve an entry (this happens later on while parsing the same XML file)
int pointId = 213453;
float x;
float y;
MyPoint *point = [points objectForKey:[NSNumber numberWithInt:pointId]];
x = point.x;
y = point.y;
This data set is consuming about 800MB of RAM with the XML file I'm working with now, and it takes quite a long time to execute. I'd like to have better performance, but even more important I need to get the memory consumption down so I can process even larger XML files.
objc_msg_send is right up there in a profile of the code, as is - [NSNumber numberWithInt:]
, and I'm sure I can get the memory usage down by avoiding objects altogether, but I don't know much about C programming (this project is certainly teaching me!).
How can I replace NSMuableDictionary
, NSNumber
MyPoint
with an efficient C data structure? Without any third party library dependencies?
I'd also like to be able to write this data structure to files on the disk, so I can work with a dataset that doesn't entirely fit into memory, but I can probably live without this capability.
(for those not familiar with Objective-C, the NSMutableDictionary class can only store Obj-C objects, and it the keys must also be objects. NSNumber and MyPoint are dumb container classes to allow NSMutableDictionary to work with float and int values.)
EDIT:
I've tried using CFMutableDictionary to store structs, as per apple's sample code. When the dictionary is empty, it performs great. But as the dictionary grows it gets slower and slower. About 25% through parsing a file (~4 million items in the dictionary) it starts to chug, two orders of magnitude slower than earlier in the file.
NSMutableDictionary doesn't have the same performance issue. Instruments shows a lot of activity applying hashes and comparing the keys of the dictionary (the intEqual()
method below). Comparing an int is fast, so something is very wrong for it to be executing so often.
Here's my code to create the dictionary:
typedef struct {
float lat;
float lon;
} AGPrimitiveCoord;
void agPrimitveCoordRelease(CFAllocatorRef allocator, const void *ptr) {
CFAllocatorDeallocate(allocator, (AGPrimitiveCoord *)ptr);
}
Boolean agPrimitveCoordEqual(const void *ptr1, const void *ptr2) {
AGPrimitiveCoord *p1 = (AGPrimitiveCoord *)ptr1;
AGPrimitiveCoord *p2 = (AGPrimitiveCoord *)ptr2;
return (fabsf(p1->lat - p2->lat) < 0.0000001 && fabsf(p1->lon - p2->lon) < 0.0000001);
}
Boolean intEqual(const void *ptr1, const void *ptr2) {
return (int)ptr1 == (int)ptr2;
}
CFHashCode intHash(const void *ptr) {
return (CFHashCode)((int)ptr);
}
// init storage dictionary
CFDictionaryKeyCallBacks intKeyCallBacks = {0, NULL, NULL, NULL, intEqual, intHash};
CFDictionaryValueCallBacks agPrimitveCoordValueCallBacks = {0, NULL /*agPrimitveCoordRetain*/, agPrimitveCoordRelease, NULL, agPrimitveCoordEqual};
temporaryNodeStore = CFDictionaryCreateMutable(NULL, 0, &intKeyCallBacks, &agPrimitveCoordValueCallBacks);
// add an item to the dictionary
- (void)parserRecordNode:(int)nodeId lat:(float)lat lon:(float)lon
{
AGPrimitiveCoord *coordPtr = (AGPrimitiveCoord *)CFAllocatorAllocate(NULL, sizeof(AGPrimitiveCoord), 0);
coordPtr->lat = lat;
coordPtr->lon = lon;
CFDictionarySetValue(temporaryNodeStore, (void *)nodeId, coordPtr);
}
EDIT 2:
The performance problem was due to the almost useless hashing implementation in Apple's sample code. I got the performance way up by using this:
// hash algorithm from http://burtleburtle.net/bob/hash/integer.html
uint32_t a = abs((int)ptr);
a = (a+0x7ed55d16) + (a<<12);
a = (a^0xc761c23c) ^ (a>>19);
a = (a+0x165667b1) + (a<<5);
a = (a+0xd3a2646c) ^ (a<<9);
a = (a+0xfd7046c5) + (a<<3);
a = (a^0xb55a4f09) ^ (a>>16);
Upvotes: 3
Views: 830
Reputation: 385650
For this sort of thing I would just use C++ containers std::unordered_map
and std::pair
. You can use them in Objective-C++. Just give your files a .mm
extension instead of the usual .m
extension.
In your comment you said you've never done C++ before. In that case, you should either try Kevin Ballard's answer of CFDictionary
, or check out the hcreate
, hdestroy
, and hsearch
functions in the standard library.
Upvotes: 3
Reputation: 23400
Rename your .m file to .mm and switch to using C++:
std::map<int, std::pair<float>> points;
Upvotes: 0
Reputation: 185681
If you want NSMutableDictionary-like behavior but with malloc'd memory, you can drop down to CFDictionary (or in your case, CFMutableDictionary). It's actually the underpinnings of NSMutableDictionary, but it allows some customization, namely you can tell it that you're not storing objects. When you call CFDictionaryCreateMutable()
you give it a struct that describes what sort of values you're handing it (it contains pointers that tell it how to retain, release, describe, hash, and compare your values). So if you want to use a struct containing two floats, and you're happy using malloc'd memory for each struct, you can malloc your struct, populate it, and hand that to the CFDictionary
, and then you can write the callback functions such that they work with your particular struct. The only restriction on the keys and objects you can use CFDictionary
with is they need to fit inside a void *
.
Upvotes: 4