I am looking for a design pattern that handles large data sets over the internet, and does periodic updating of these objects. I am developing an application that will display thousands of records in the UI at one time. Additionally, various properties on these objects are quite transient and need to be updated on the client to keep the user aware of the changing state of these records in the system. I have a few ideas how to approach this problem, but figured there might be a design pattern (or patterns) out there that handles this type of scenario. Limitations: The client-side for this is being written in Silverlight. The objects themselves are not very big (about 15 value-type and string properties), but querying for all the data is expensive. The 15 or so properties contain data from various sources; no clever join statement or indexing is going to speed up the query. I am thinking of populating only a subset of the properties on initial load and then filling in the more expensive details as the user zooms in on a given grouping of objects. Think Google maps, but instead of streets and building it is showing the objects. I will be able to limit the portion of the thousands of objects that are being updated. However, I will need the user to be able to "zoom out" of an context that allows granular updating to one that shows all the thousands of objects. I imagine that updating will be disabled again for objects when they leave a sufficient zoom context. Ideas on how to tackle all or part of this problem? Like I mentioned I am considering a few ideas already, but nothing I have put together so far gives me a good feeling about the success of this project. Edit: I think the difficult parts really boil down to two things for which I may need two distinct patterns/practices/strategies: Loading a large number of records over the internet (~5k). Keeping a subset of these objects (~500) update-to-date over the internet. There are several design patterns that can be used for everything else. Edit 2: Thanks for the links on various "push" implementation in Silverlight. I could swear sockets had been taken out of Silverlight but found a Silverlight 3 reference based on an answer below. This really wasn't a huge problem for me anyway and something I hadn't spent much time researching, so I am editing that out of the original text. Whether updates come down in polls or via push, the general design problems are still there. Its good to know I have options. Edit 3: Follow-up on push technologies. As I suspected the Silverlight WCF duplex implementation is comet-like push . This won't scale, and there are numerous articles about how it doesn't in the real world. The sockets implementation in Silverlight is crippled in several ways. It looks like it is going to be useless in our scenario since the web server may sit behind any given client firewall that won't allow non-standard ports and Silverlight sockets won't connect on 80, 443, etc. I am still thinking through using the WCFduplex approach in some limited way, but it looks like polling is going to be the answer. Edit 4: Found a pattern to solve half my problem I found this pattern (PDF) which illustrates the use of an iterator pattern to retrieve pages of data from the server and present them as a simple iterator. In .Net land I imagine this would be implemented as IEnumerable (samples code is in Java and Oracle SQL). Of particular interest to me was the asynchronous page prefetching, basically buffering the result set client-side. With 5k objects everything won't fit on the screen at once, so I can use a strategy of not getting everything at once yet hide that implementation detail from the UI. The core objects the app will be retrieving are in a database, then other look-ups are required to fully populate these objects. This methodology seems like a good approach to get some of the data out to the client fast. I am now thinking of using this patter + some sort of proxy object pattern that listens for deltas to the result set and updates object accordingly. There are a couple of strategies one could take here. I could load all the data upfront, then send deltas of changes (which will probably need some additional code in the subsystems to provide notification of changes). This might be my first approach. I am still looking. Thanks for all the ideas so far.

I up-voted a couple of good answers, but came up with a solution with some changes to the back-end data and a new way of retrieving the data from Silverlight. Here is what is being done to address this: I am using beans to represent that large data graph. This removed a lot of transmission XML. I am only concerned with a subset of the data anyway, although its a rather significant subset. By flattening the data into a bean I think I have cut my serialized object size to about 20 - 25% of the original object graph. Almost all data on the back end will now have a field for the last time it was modified. I was able to get this for all the big data. There are a few pieces of data that won't have this, but the real problems of query performance and data aggregation were solved with this. As a general solution for others, it looks like this is rather simple to implement in a number of DBMSs. I am writing new APIs to retrieve data that has been updated after a provided DateTime. This allows me to query only for new and changed objects from the back-end system (this is the web service calling these APIs, and the Silverlight is calling the web service). Aggregate changes in the web service and detect if a portion of the datagraph has changed. For simplicity I just send the entire datagraph if anything has changed. This was actually the hardest part to figure out. A part of the datagraph could have a new updated time, but the core object of the graph has not been updated. I ended up having to write APIs to look for the changes of the sub-objects, and then API's to find the root objects based on those sub-objects (if they had been changed). An object graph can be returned with a root object (and actually much of the object graph) that has not been updated since the last poll. The web service logic is querying on small numbers of changes so even though the queries are not cheap individually, they will potentially only run a few times per poll. Even in very large installations of our product, this query loop will only run 10 or 20 times per polling cycle (see about my polling solution below). While our systems are very dynamic, not that much changes in 30 seconds. The web service call that handles all of this reacts the same to an initial load call as it does a polling. All it is concerned with is retrieving data newer than a given time. I wrote a collection that inherits from ObservableCollection that handles the querying and polling. The client code using this collection provides a delegate that queries the data. The date is returned asynchronously, and in pages. I haven't settled on a page size. It keeps re-querying for pages until the server returns a page that is smaller than the max page size. The collection is also provided information on how to determine the latest date of the newest object in the collection. It polls periodically for updates that are newer than the newest item in the collection. In reality this "latest date" is actually an object containing several dates of various parts of the original object graph. If an item returns from the server that are exists in the collection, the item in the collection is update with that returned data. I did this instead of inserting the new item and removing the old because it works in more databound situations. This pattern could be improved. I could send only deltas to Silverlight for changes. I could still try to use some sort of push technology. But this solution gives me one web service call that can return data for various cases. Polling is also very simple, and there is just one thing doing all of the data retrieval. There aren't a lot of moving parts. This handles object state changes both during the initial data load, and during polling, through the same mechanism. This also seems to scales well. The initial call seems to be the most expensive with subsequent calls running faster and faster. I would assume that this is because the data that is remaining on the back-end is getting smaller and smaller with each pass. I still have one question about my implementation of this that I have posted here . Thanks for all of the suggestions. While I didn't heed all of the advice, several ideas either directly helped me or got my mind thinking down a different path on how to get this working.

Is there a design pattern for dealing with large datasets over the internet?

Answers (8)

Jason Jackson

Reputation: 17260

I up-voted a couple of good answers, but came up with a solution with some changes to the back-end data and a new way of retrieving the data from Silverlight. Here is what is being done to address this:

I am using beans to represent that large data graph. This removed a lot of transmission XML. I am only concerned with a subset of the data anyway, although its a rather significant subset. By flattening the data into a bean I think I have cut my serialized object size to about 20 - 25% of the original object graph.
Almost all data on the back end will now have a field for the last time it was modified. I was able to get this for all the big data. There are a few pieces of data that won't have this, but the real problems of query performance and data aggregation were solved with this. As a general solution for others, it looks like this is rather simple to implement in a number of DBMSs.
I am writing new APIs to retrieve data that has been updated after a provided DateTime. This allows me to query only for new and changed objects from the back-end system (this is the web service calling these APIs, and the Silverlight is calling the web service).
Aggregate changes in the web service and detect if a portion of the datagraph has changed. For simplicity I just send the entire datagraph if anything has changed. This was actually the hardest part to figure out. A part of the datagraph could have a new updated time, but the core object of the graph has not been updated. I ended up having to write APIs to look for the changes of the sub-objects, and then API's to find the root objects based on those sub-objects (if they had been changed). An object graph can be returned with a root object (and actually much of the object graph) that has not been updated since the last poll. The web service logic is querying on small numbers of changes so even though the queries are not cheap individually, they will potentially only run a few times per poll. Even in very large installations of our product, this query loop will only run 10 or 20 times per polling cycle (see about my polling solution below). While our systems are very dynamic, not that much changes in 30 seconds. The web service call that handles all of this reacts the same to an initial load call as it does a polling. All it is concerned with is retrieving data newer than a given time.
I wrote a collection that inherits from ObservableCollection that handles the querying and polling. The client code using this collection provides a delegate that queries the data. The date is returned asynchronously, and in pages. I haven't settled on a page size. It keeps re-querying for pages until the server returns a page that is smaller than the max page size. The collection is also provided information on how to determine the latest date of the newest object in the collection. It polls periodically for updates that are newer than the newest item in the collection. In reality this "latest date" is actually an object containing several dates of various parts of the original object graph. If an item returns from the server that are exists in the collection, the item in the collection is update with that returned data. I did this instead of inserting the new item and removing the old because it works in more databound situations.

This pattern could be improved. I could send only deltas to Silverlight for changes. I could still try to use some sort of push technology. But this solution gives me one web service call that can return data for various cases. Polling is also very simple, and there is just one thing doing all of the data retrieval. There aren't a lot of moving parts. This handles object state changes both during the initial data load, and during polling, through the same mechanism. This also seems to scales well. The initial call seems to be the most expensive with subsequent calls running faster and faster. I would assume that this is because the data that is remaining on the back-end is getting smaller and smaller with each pass.

I still have one question about my implementation of this that I have posted here.

Thanks for all of the suggestions. While I didn't heed all of the advice, several ideas either directly helped me or got my mind thinking down a different path on how to get this working.

Upvotes: 0

Doug L.

Reputation: 2716

I wonder if you could reduce the amount of data going to the client screen in the first place? You can't see 5,000 points of data all at once anyway. And if you need to scroll to look for the important stuff, consider filtering out the non-important stuff to begin with. Consider some UI designs (dashboard and gauge type stuff) so that the user only sees the trouble spots. Then they can drill in and take action as required.

I know you can't reveal details and I've made a ton of assumptions and this is not a direct technical answer to your question - but maybe rethinking the necessary data feed would help push you in a more efficient direction for both the back-end and the front-end.

Upvotes: 2

PL.

Reputation: 2195

If I understand correctly, there are really two problems here:

State of the system is represented by data coming from multiple data sources. As the result querying for state is expensive.
The amount of data that describes the state of the system is large. As the result querying all data that describes the state is expensive.

Standard patterns for solving these problems are to introduce a middle tier and use deltas to update the state. E.g.:

Clearly, you don't want your Silverlight clients directly talking to backend systems. Not only that might be not possible, it is also very inefficient, since every client can ask same data source about its state. To avoid this standard solution is to introduce a middle tier that aggregates the data coming from all backend data sources, and also provides common interface to the clients. As the result the backend data sources will be polled only as often as needed (can be configured per data source in the middle tier), and also clients don't have to deal with specifics of those backed data sources. Plus you can implement indexing of the data in the middle tier, based on queries most common for the clients.
Assuming that every record will have ID, client should request only deltas since the last update. One of the many patterns is to use a timestamp. E.g. when client initializes it requests the state of the system, and the middle tier sends that state, with timestamp. When client needs to update certain records, it provides IDs in the request, and the timestamp of the last update. Therefore middle tier will send only changes since that last timestamp, and only for requested IDs. If object has 15 properties, and only 3 of them were changed since the last timestamp, then update will contain only values of those 3 properties.

As for push vs. poll - push is not automatically the best solution. It really comes down to a question of the trade-off between how often client needs to be updated and amount of traffic between client/middle tier. E.g. if state change is frequent but sparse (e.g. affects only a few properties at a time), and there's no requirement to update client's state immediately, client might prefer for changes to accumulate rather than receiving every single update, thus polling would be preferable.

Upvotes: 1

Buzz

Reputation: 295

2 proposed solutions can be

1) compress your collection and decompress it after transfer.

2) Use Flyweight + proxy pattern.

Upvotes: 2

philfreo

Reputation: 43834

In general I think the answer to your question is that there isn't one or more Design Patterns that will really solve your problem. Instead, this is a great example of a decently large-scale application which just needs a lot of planning and design work.

When you're designing, I think you'll come across some DPs that may help you on a small level, but the details of how this giant thing should work is more of a general (and interesting) design problem.

Perhaps clarifying your questions slightly may help lead people to giving advice on the overall design for this system. Also, once you've put some design/effort into coming up with a high-level design of how this should work, you could ask for critiques/suggestions. It's hard for someone to come up with this completely as an answer to a StackOverflow question. :)

Upvotes: 1

FrenchData

Reputation: 630

Here I found an article that seems to explain how to create sockets in Silverlight 2 Silverlight 2 and System.Net.Sockets.Socket

I have not read it very deeply (it is a little too late for me to do that) but it seems that it could be usable in your case. The main limitation I have seen is that your silverlight application can only connect to the server it was downloaded from.

Here a tutorial on channel 9 Silverlight using socket

I hope this will help

Upvotes: 1

monksy

Reputation: 14234

The proxy design pattern is the pattern that will aid in transfering data from one point to another. The proxy design pattern will allow you to treat remote objects as if they were local.

Upvotes: 2

andyp

Reputation: 6269

I think you might be missing something: since Silverlight 3 there's the ability to push data to the client. Here's an article which might be helpful with that.

Upvotes: 0

Is there a design pattern for dealing with large datasets over the internet?

Answers (8)

Related Questions