Walter Williams
Walter Williams

Reputation:

Graphing large amounts of data

In a product I work on, there is an iteration loop which can have anywhere between a few hundred to a few million iterations. Each iteration computes a set of statistic variables (double precision), and the number of variables can be up to 1000 (typically 15-50).

As part of the loop, we graph the change in the variables over the iterations, so the X axis is iterations, and the y axis are the variables (coded by color):

http://sawtoothsoftware.com/download/temp/walt/graph.jpg

Currently the data are stored in a file with containing:
a 4 byte integer for which variable,
a 4 byte integer for which iteration,
and a 8 byte double for the value.

The total scale of the y axis changes over time, and it is desired that the graph resize to accomodate the current scale (this can be seen in the picture).

At about 5 second intervals, the data are read and plotted on a bitmap which is then displayed to the user. We try to do a few optimizations to avoid repainting the whole thing, but if the number of iterations or the number of variables gets big, we end up with an enormous file which takes longer than 5 seconds to draw.

I'm looking for ideas on how to handle this much data more effectively and quickly if possible.

Upvotes: 2

Views: 2743

Answers (4)

abalakin
abalakin

Reputation: 827

Why you don't produce a bitmap (or pixmap like XPM)? Each column (or row) correspond to iteration, and height of same colors (width for rows) correspond to the variable value. XPM format is simpler since it is textual (one character for pixel) and cross-platform.

Upvotes: 0

spoulson
spoulson

Reputation: 21601

In SQL terms, you should group and aggregate the results. You can't possibly show all 10,000 data points on the graph without scrolling way off the screen. One way is you could group by a time scale (seconds, minutes, etc.) and query the AVG(), MAX(), or MIN() to reduce the data points to a smaller scale.

MySQL example, group by seconds:

select time_collected, AVG(value)
from Table
group by UNIX_TIMESTAMP(time_collected)

Also consider combining aggregate values and visualizing in a candle stick chart.

Upvotes: 4

Nir
Nir

Reputation: 25369

I see by the graph that you're plotting 10,000 iterations on a few hundred pixels so just use one in 100 information points for the graph and ignore the rest. It will look the same to users

Upvotes: 1

Todd Gamblin
Todd Gamblin

Reputation: 59847

You should ask yourself how valuable it is to display data for every iteration, and what about this data the user really cares about. I think the main thing you need to do here is just reduce the amount of data you display to the user.

For example, if the user only care about the trend, then you could easily get away with evaluating these functions only every so many iterations (instead of every iteration). On the graph above, you could probably get just as informative a plot by drawing only the value on the curve every 100 iterations, which would reduce the size of your data set (and the speed of your drawing algorithm) by a factor of 100. Obviously, you could adjust this if you happen to need more detail.

To avoid having to recompute data points when you redraw, just keep around the small set of points you've already drawn in memory instead of recomputing or reloading all the data. You can avoid going to disk this way, and you won't be doing nearly as much work getting all those points rendered again.

If you're concerned about things like missing outliers due to sampling error, a simple thing you can do would be to compute the set of sample points based on sliding windows instead of single samples from the original data. You might keep around max, min, mean, median, and possibly compute error bars for the data you display to the user.

If you need to get really aggressive, people have come up with tons of fancy methods for reducing and displaying time series data. For further information, you could check out the wikipedia article, or look into toolkits like R, which have a lot of these methods built in already.

Finally, this stackoverflow question seems relevant, too.

Upvotes: 3

Related Questions