Reputation: 579
I am new to MongoDB. I am trying to read data from several collections. I want to do some statistics on GHTorrent, so I am attempting to print a .csv with the data I'm interested in. The problem is that my query has now been running for some 30 minutes and I'm sure my search is less effective than it could be, I'm just not sure how to improve it.
First, I do
closed_issues = ghdb.issues.find(
{ "state": "closed" }, # query criteria
{ #projection
"id": 1,
"created_at": 1,
"closed_at": 1,
"comments": 1,
"repo": 1,
"owner": 1,
"number": 1,
}
Then, after opening a file and writing headlines, I do
for issue in closed_issues:
countMentioned = ghdb.issue_events.find({
"issue_id": issue['number'],
"repo": issue['repo'],
"owner": issue['owner'],
"event": "mentioned" }).count();
countSubscribed = ghdb.issue_events.find({
"issue_id": issue['number'],
"repo": issue['repo'],
"owner": issue['owner'],
"event": "subscribed" }).count();
countAssigned = ghdb.issue_events.find({
"issue_id": issue['number'],
"repo": issue['repo'],
"owner": issue['owner'],
"event": "assigned" }).count();
time_created = parser.parse(issue['created_at'])
time_closed = parser.parse(issue['closed_at'])
timediff = time_closed - time_created;
f.write(
str(issue['id']) +","+
str(issue['number']) +","+
str(issue['repo']) +","+
str(issue['owner']) +","+
str(timediff.total_seconds()) +","+
str(issue['comments']) +","+
str(countMentioned) +","+
str(countSubscribed) +","+
str(countAssigned) +'\n'
)
As you can see, I use three of the four same criteria for three different finds per issue. What is the most effective way of doing a search for one combination of issue_id
, repo
and owner
and doing counts for each of three different event
?
Upvotes: 0
Views: 190
Reputation: 27487
The mongodb aggregation framework is a great tool for queries that produce aggregated stats like counts - http://docs.mongodb.org/manual/core/aggregation/
I'd start there and play around with it a bit. For this kind of use case you can usually start there and then wrap a bit of additional code around the result to export the data in the format you need.
Upvotes: 1