Reputation: 13002
What is the most efficient way, in terms of cost and scalability, to pull stats on large volumes of data?
Let's take a concrete example, there are 1000 companies, each with 10000+ customers. These companies are all in retail, or let's make it more generic, they are any company in any industry and they want to know certain things from their customers.
10 of these companies wants to know how well their call centres are doing and sends out an email asking customers to rate them 1 - 5, customers click on a link and rate them 1 - 5.
20 of these companies, which could include some of the previous 10, wants to know something else and asks for a rating 1 - 5.
Now if I want to give each of these companies feedback on their average rating or where they stack up against the other companies who sent the same questionnaire or had overlapping questions, what would be the best strategy to calculate these stats?
Option1: Have a special entity just for stats, every time a customer rates the company for something, increment the stats counters (eg increment stats counter for number of votes, vote total, increment male / female counters if you're tracking votes based on gender, etc) The problem with this approach is that you'll be doing y number of extra writes (where n is the number of stats reports you want to track for) for every data entry and then you're also limited to those stats that you wanted to track. Also, you'll be limited to 1 write/s as Peter mentioned in his response here: Using Objectify to concurrently write data on GAE If x is the number of entries and y the number of stats reports you want to pull, you'll be doing x * y writes and y reads to report on stats.
Option2: Do something like: ofy.query(MyEntity.class).filter("field", v).count(); Pitfalls being that you're looking up all those entities, does GAE charge for read x operations if you're doing a count that results in x number of entities? Also, if you're potentially running through 20000 entries, won't you hit some sort of limit in terms of time-outs, max reads per query, etc? Depending on how often I pull stats, this will mean x number of reads every time I pull stats assuming I won't hit some limits.
Option3: Put an extra property in each feedback entry for every piece of stats you're trying to build. Then have a scheduler run every hour / day / week / ..., use cursors to run through each of the entries, mark the stats column as counted and add that value to a stats entity. If the number of feedback entries are x and you want to pull y number of reports on this data, that means (assuming you do the calculations in memory and not immediately in a stats entity) x number of writes to mark the x number of feedback reports as counted and y number writes every hour / day / week to store the updated stats values. This means that for x number of feedback reports, I'll be doing at least 2 * x writes and only y number of reads to read the stats.
All of the above seems yucky, is there a better way to do it? If not, which of the above is the better way to do it that won't break when the volumes are massive and that won't dramatically increase costs over what the costs already are in terms of reads / writes / storage.
Upvotes: 1
Views: 200
Reputation: 80340
GAE is not a good option to do analytics, because of concurrent write limitations and lack of good query language.
If you are serious about analytics you should export data from GAE to BigQuery and do analytics there. Check out Mache and this blogpost.
Upvotes: 2