Reputation: 5020
We are working with weather data and about ten years of weather stations data stored on a DB. We have built a REST API that given a day, a station and a measured variable returns the data in JSON format. Performance is good to build small apps but a few queries is not suitable for high traffic.
We are thinking on caching (and pre-caching) the JSON data of each day-station-variable. Initially we though on Redis but the problem is our data goes up to 400Gb.
Here is when I ask for some help and similar experiences: - Is a good idea to dump data to disk files, where each file contains the JSON result of a day-station-variable query? - Any experiences with EhCache or JCS? Are they suitable for this?
Cheers.
Upvotes: 1
Views: 1167
Reputation: 5723
Caching this amount of data within Java is not the best option. If you go to big heap sizes you get GC stalls or need to heavily tune it. You could EHCache with the BigMemory option. However this means, on each request the CPU needs to deserialize the Java objects from the off heap memory and generate JSON data again.
So although this is tagged with Java caching, I would rather suggest a non Java solution.
I think the response for an URL never changes, because you are querying past weather information. So, just put on proper HTTP caching headers and let the front end (caching) web server, do everything. For front end caching servers the products nginx or varnish are quite in common use.
Another option is preproducting JSON files and just serving static files. This is not so bad as it seems. Files and contents are cached very well by the operating system, and on Linux and BSD there is the sendfile system call, so that a file content is stuffed directly into the TCP buffers by the operating system. It is also a good thing to preproduce compressed versions of the version of the files. It is possible to configure the web server to pick up automatically the file with .gz suffix, if compression is in the accept encoding header. Since all web clients usually support compression, these will be the files that your OS is holding in memory and serve them quickly.
With just having files in a file system, the system memory will be used for caching by the operating system very efficiently and completely. If you put any other storage or processing stuff onto the problem you have much more "knobs" to tune and I doubt you will get a better result.
Good luck!
Upvotes: 2
Reputation: 3129
My 2 cents for a large data store.
Firstly, it's inadequate for a file based data store solution. That basically says your data is disk IO bounded and it's hard for you to be able to achieve the optimization that any commercial DB like Oracle has done to its disk IO access, even if you are sort of "object based file structure". My past experience of caching such data uses in-memory caching techniques like Coherence cache. Basically you build up a cluster of servers, each with high amount of memory (say 48GB) and you cache all your objects in memory. Think it like a large hash map with redundancy factor which you could configure. You could define your key in custom way.
Secondly, it looks obvious that your solution is space bounded and you could think of shifting some pressure to CPU bounded - by either compress the JSON format; or store binary data and convert to JSON in real time. That should be able to shrink your data to a large ratio. You need to choose a proper format so that the CPU won't be overloaded though but I guess that's very unlikely.
The above is based on the assumption that the queries are single i.e. to query single combination of (date, station). If you have other frequent queries then some supporting data structure like indices needs to be used.
Upvotes: 2