Reputation: 4477
Here is my Document structure:
{
"_id" : ObjectId("50dcd7ff4de274a2c4a31df0"),
"seq_name" : "169:D18M6ACXX:1:1111:17898:82486:GTGACA_10",
"raw_seq" : "TTGACCTGAGGAGACGGTGACCAGGGTTCCCTGGCCCCAGTAGTCAACGGGAGTTAGACTTCTCGCACAGTAATAAACAGCCGTGTCCTCGGCTCTCAGGCTGTTCATTTGCAGA",
"seq_aa" : "LQMNSLRAEDTAVYYCARSLTPVDYWGQGTLVTVSSGQ",
"cdr3_seq" : "GCGAGAAGTCTAACTCCCGTTGACTAC",
"cdr3_seq_aa" : "ARSLTPVDY",
"cdr3_seq_len" : 27,
"cdr3_seq_aa_len" : 9,
"vg" : "IGHV3-48*03",
"dg" : "IGHD3-10*02R",
"jg" : "IGHJ4*02",
"donor" : 10
}
I really enjoy MongoDB framework but I'm having trouble with this grouping pipeline and since I can't $out to another collection yet. I can do this multi-grouping pipeline.
db.collection.aggregate({$match:{cdr3_seq_aa_len:{$gt:3}},
{$group:{_id:$cdr3_seq_aa,other_set:{$addToSet:$cdr3_seq_aa_len}}},
{$group:{_id:$other_set,sum:{$sum:1}}})
Which gives me how many unique$cdr3_seq_aa's there are grouped by length.
{ id:40, sum:1002031,
id:41, sum:1949402,....
However The first operation I would like to do is group by donor. So I can first know how many unique cdr3_seq_aa strings there are among each donor. Then I would like to group it by length and count how many strings group with the length.
Upvotes: 2
Views: 6369
Reputation: 16705
If I understand the question correctly, this is what you're looking for. The key concept is that you can construct compound _id's from multiple fields.
db.collection.aggregate(
[
{$match: {cdr3_seq_aa_len: {$gt: 3}}},
{$group:
{
_id: {donor: "$donor", cdr3_seq_aa: "$cdr3_seq_aa"},
donor_cdr3_seq_aa_count: {$sum: 1},
cdr3_seq_aa_len: {$first: "$cdr3_seq_aa_len"}
}
},
{$group:
{
_id: {donor: "$_id.donor", len: "$cdr3_seq_aa_len"},
num_strings_with_this_length: {$sum: 1},
total_doc_count_by_length:
{$sum: "$donor_cdr3_seq_aa_count"}
}
}
])
Upvotes: 5