Reputation: 31
Question is a bit long so please bear with me.
I'm writing java code to aggregate flows from a full day network trace into 84 sec bins for each subnet. Currently, I have upto 256 subnets and 1024 bins for each subnet. I use this to acquire traffic feature statistics such as number of connections, in/out bytes, number of external ip addresses in each window of each subnet. While connections, in/out bytes is simple, getting unique number of exteran ip addresses is causing OutOfMemory errors.
To determine unique number of external ip addresses, I need to store IP address in some data structure such as hash table, and at the end of trace I can get the size of this hashtable. This means I'll have 1024*256 hashtables, each storing large number of 12-15 byte IP Address strings (tens to thousands). This quickly blows up and system goes out of memory (I've tried to set java heap size to upto 2GB with no avail). Can anyone suggest a way to store such a large number of objects efficiently?
I tried using bitset (converting ip to a int) however considering ip addresses are very very sparse, it doesn't help with memory situation. As the last resort, I might use colt library sparse matrices with each double storing upto 64 ip addresses but I wanted to get an opinion in case I'm missing something obvious and can save up on time writing/debugging such a wrapper.
Sidenotes: To get an idea of scale, I see several hundred million flows per trace that I parse and aggregate. I'm using 10 to 20 of 256 subnets in most cases but I'd like the solution to be scalable to all 256 subnets.
Upvotes: 3
Views: 4746
Reputation: 533780
I would have multiple BitSets e.g.
private final BitSet[] ips = new BitSet[256*256*256];
public void sample(int address) {
BitSet bs = ips[address >>> 8];
if (bs == null)
ips[address >>> 8] = new BitSet();
bs.set(address & 0xFFFF);
}
public int count() {
int total = 0;
for(BitSet bs: ips)
total += bs.cardinality();
return total;
}
This will as little as 1 bit per address, depending on how spare the IP address. Given many address ranges won't appear, memory consumption could be very efficient. Its very hard to test without a realistic dataset. ;)
The worst case memory size is 512 MB, but it should be much less than this for realistic data sets.
Upvotes: 0
Reputation: 38842
Update: If you stored the entire 4 Billion IPv4 addresses as a single array then you could represent time as an individual short.
short[] ipv4 = new short[Integer.MAX_VALUE * 2]; // technically not possible blah blah
That'd be 8GB with 65K time resolution. Just consider that because it puts an upper bound on the memory because any other scheme must be under that. If you used a byte it'd be 256 time resolution for 337.5 sec per bin, and 4 GB.
Now that gives you only a bit to say you saw at least a packet within that bucket. If you need a count that blows up the memory again, but with a short you could use 1024 buckets with a potential 6 bit resolution for counting: 64 packet max.
Now with 100 million unique IPs that reduces memory by 10x so you went from 8GB to 800MB in theory. While not allocating the entire space you think you can save memory, but you still have to store 4 bytes per IP: 400MB just for the IPs + 400MB for some sort structure to hold them (100M pointers * 4 bytes), and 2 bytes for time: 1GB minimum. By allocating the full space you get the ability to skip storing the IP over again because your hash is your IP. If you reduce the array you don't have the IP anymore because it's been hashed away. Now you could not store the IP and still answer questions given an IP, but you can't regurgitate it.
What if you stored a series of subnet masks, and then rolled up all the IPs underneath that, and keep your stats on that subnet mask. For example, you have 256 subnets with their own subnet mask. Your program would take in a lower bound of the mask. That being if you mask is 209.134.0.0/16 and use a lower bound of 8. Then it would create 256 bins for that subnet that were apart of 209.134.0.0-209.134.255.255. You'd repeat that same process for all 256 subnets you have. With a lower bound of 8 bits means the lower 256 addresses of each subnet would be rolled up. You can hash any IP address into the bin and keep statistics in memory. However, you can't say anything about a single IP address. But, if you need more resolution you can just drop the lower subnet mask say to 4, and now there are more bins.
You only create a bin if you have 1 IP within it so if you don't have IPs showing up there you can save some space so it's balancing act between low enough for descent resolution, but high enough to skip creating bins for things you don't need.
Then you could write out a log of each bin and track what occurred in each bin on the disk. If you want to answer a question about a single IP you figure out which bin it belongs in then open the file and search within there to find the answer. This scheme means you can scale up or down depending on the size of your data, and by raising and lowering your bounds. You could get performance boost by changing the file structure you write out per bin.
I know sorry for the length! :-)
Upvotes: 2
Reputation: 76908
Not not sure why you have 1024*256?
You only need one data structure to hold all the data; Use a red-black tree keyed by the IP as a 4-byte integer. This gives you O(log(n)) lookup time which means worst case is 32 comparisons to find an IP. Or a HashMap
keyed by Integer
.
In each node have your 84 "bin" objects (stored in a Linked list, Array, or whatever makes sense in terms of the access pattern you have) that contain the information you want to store. If you only need the aggregate ... only store the aggregate. This really would cut down your memory usage.
Edit: I tend to forget about Java int
s being signed. That doesn't pose a problem unless you actually want to sort them easily in which case use a long
/Long
Upvotes: 1