Reputation: 4062
I have a table collection of ~5 million documents. They look like the following:
fr to weight
I have 6000
possible values for fr
and to
. I have created the following code to sum all the weights from all possible fr
field to to
, e.g.
1 1 sum(all documents with fr=1 to=1)
1 2 sum(all documents with fr=1 to=2)
...
I used pyMongo
to write the script, while cells
is an array of all possible fr
/to
fields, and calls
is the collection I have my ~5 million documents.
for _from in cells: #pair up each cell with each other cell
for _to in cells:
s = calls.aggregate([
{
'$match':
{
"fr": _from,
"to": _to
}
},
{
'$group': {'_id': 'null', 'sum': {'$sum':'$weight'}}
}])
if s['result']:
fr_to_sum = s['result'][0]['sum']
_id = entropy.insert({
'to': _to,
'fr': _from,
'sum': fr_to_sum
})
I am using a single machine.
Would another database be better? E.g. some SQL?
UPDATE
The fields fr
and to
contain integers in the range of 1-100000
, while weight
is a small float
. I try to make all possible combinations between fr
and to
and sum the values for these combinations, because there might be duplicate documents in the database which have fr
and to
the same (e.g. fr=1 to=2 weight=0.004
and fr=1 to=2 weight=0.01
, so in this case I would like to have in my entropy
table fr=1 to=2 weight=0.014
.
And just as a sample
>> db.week_calls.count({'fr':10, 'to':102})
>> 4
After calling db.week_calls.getIndexes()
I got
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "db.week_calls",
"name" : "_id_"
}
This seems odd, since I created and index on the whole thing with db.calls.ensure_index([("fr",1), ("to",1)])
Upvotes: 0
Views: 888
Reputation: 151122
The basic form of what you are trying to do, and without iterating is this:
db.collection.aggregate([
{ "$group": {
"_id": {
"fr" : "$fr",
"to" : "$to",
},
"sum": { "$sum": "$weight" }
}}
])
Now that may well blow up on you so you might need to do some looping and "range" the values as in:
db.collection.aggregate([
{ "$match": {
"fr": { "$gt": 0, "$lte": 10000 },
"to": { "$gt": 0, "$lte": 10000 }
}},
{ "$group": {
"_id": {
"fr" : "$fr",
"to" : "$to",
},
"sum": { "$sum": "$weight" }
}}
])
To segment all of those documents, and then get the next segment:
db.collection.aggregate([
{ "$match": {
"fr": { "$gt": 10000, "$lte": 20000 },
"to": { "$gt": 10000, "$lte": 20000 }
}},
{ "$group": {
"_id": {
"fr" : "$fr",
"to" : "$to"
},
"sum": { "$sum": "$weight" }
}]
])
Whichever solution that works out best for you.
The general point is let $group
do the work for you and break that up into manageable chunks where the results may be too big for one fetch.
In future releases aggregate will return a cursor.
Upvotes: 1