Reputation: 48
I've been trying to figure this out for a while now, reading lots of SO questions and wading through the docs but to no avail. I hope someone here can point me in the right direction!
Consider the following documents:
{
"_id": "045bdeb40176b33cf07b21cd1fb3949e",
"type": "test_result",
"customer_id": "customer",
"product_id": "product1",
"type_id": "type",
"version_id": "1.0.0",
"timestamp": 1381505909000,
"test_result": "passed",
"serial_nr": "NEP000001"
}
{
"_id": "045bdeb40176b33cf07b21cd1fb3c434",
"type": "measurement_result",
"test_result_id": "045bdeb40176b33cf07b21cd1fb3949e",
"measurement_id": "customer:product1:type:1.0.0:0",
"timestamp": 1381505909000,
"data": 2.5
}
The application contains data of units measured in different ways. One test_result
document is inserted each time a unit is tested. Each unit has a unique serial_nr
. One measurement_result
document is inserted for each measurement made on a unit (typically around 50 measurements per unit). Each measurement for that unit has a unique measurement_id
. When the results are inserted the timestamp
is generated. One unit can be tested more than once.
The views that I'm trying to build are the following:
measurement_id
, but only for the oldest results. If any one unit has been tested more than once, only measurement results from the first test should be included. measurement_id
, but only for the most recent results (same rules as for the oldest results).The goal is to reduce the data
field to calculate statistics like average
, min
, max
and standard deviation
, and it is very valuable to be able to separate the statistics by first/latest time a unit was tested.
I've been trying to use complex keys, very advanced reduces and many other approaches, but I can't seem to isolate the latest/newest results.
This application is still nowhere near production, so any solutions are welcome (even switching database system, I guess). Should I structure the data in any other way? Is this even possible to do? I will, in the long run, have lots of data so it is very important that I can calculate the statistics incrementally.
This question seems to be more or the less the same as mine, but there were never any answers and I've already come as far as he did.
Update #1
For cases #1 and #2 I could probably just map by [serial_nr, timestamp]
and then turn off reduce_limit to allow me to only return the most recent entry. I don't know how that affects performance in the long run though?
For #3 and #4 it's harder. Since I need to calculate grouped by measurement_id
it needs to be the first element in the key array. But then what?
(To make things simpler I will now assume that the measurement_result
documents also have the serial_nr
)
map:
function(doc) {
if (doc.type == 'measurement_result')
emit([doc.measurement_id, doc.serial_nr, doc.timestamp], doc.data)
}
reduce:
_stats
GET
with group_level=1
is my only option here because otherwise I will get separate results for each measurement_id - but I still have not managed to filter out only the most recent or the oldest results, this only gets me all of them. Now I could perhaps write a reduce function to somehow check for duplicate serial_nr
and only return the most recent/oldest, but I can't figure out how.
Hope this clears up the question a bit.
Upvotes: 0
Views: 302
Reputation: 118
I think one mistake you might be making is trying to organize your data in the RDBMS way. If your measurements are honestly only going to be about 50 or so they could easily live in the same document. you only need to be concerned if there is no upper bound to the amount of entry's....I've had thousands in the same doc which I don't recommend.
using couchdb's update handlers your could just create a function that could add values to the array in order....here is a quick reference
basically your update handler has to create the doc if it doesn't exist and add entries to the array. Using your example you could have -measurements and -test-result as easy natural keys. your new doc should look something like this:
{
"_id": "NEP000001-measurements",
"type": "measurement_result",
"test_result_id": "045bdeb40176b33cf07b21cd1fb3949e",
"serial_nr": "NEP000001",
"measurements": [
{
"measurement_id": "customer:product1:type:1.0.0:0",
"timestamp": 1381505909000,
"test_result_id": "045bdeb40176b33cf07b21cd1fb3949e",
"data": 2.5
},
{
"measurement_id": "customer1:product2:type:1.0.0:0",
"timestamp": 1381505909005,
"test_result_id": "045bdeb40176b33cf07b21cd1fb3949e",
"data": 2.7
}
]
}
Couchdb views allows you to split what ends up in the view with what the data looks like in the database.
Anyways after this you could have a view functions that was something like this: 1) first _view/first_measurements
map: function(doc) { if (doc.type == 'measurement_result'){ var first = doc.measurements[0]; emit([first.measurement_id, doc.serial_nr, first.timestamp], first.data) } }
reduce: _stats
2) _view/latest map: function(doc) { if (doc.type == 'measurement_result'){ var last = doc.measurements[doc.measurements.length-1]; emit([last.measurement_id, doc.serial_nr, last.timestamp], last.data) } }
reduce: _stats
//I am a little confused about the difference between measurement_id and the id of the document and what has to be unique so I may can't answer this quite yet, but it sounds like you can use startkey and endkey ranges combined with maybe descending = false...to get what you want..also include_doc=true might come in handy here for retrieving the doc that emitted the value....(or any other doc in the database :-))..anyways hope this helps
Upvotes: 3