Reputation: 265
Let’s say you have a movie subscription service with normal and premium memberships.
Here is a sample of data generated by user activity and stored as documents in a collection:
[
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 1
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 1,
"elapsed": 200
},
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 2
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 2,
"elapsed": 500
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 3
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 3,
"elapsed": 10
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 4
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 4,
"elapsed": 100
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 5
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 5,
"elapsed": 5
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 5,
"elapsed": 25
}
]
You can see that there are two “eventTypes”:
“sessionInfo” documents that have information common to an entire user session
“mediaPlay” documents that store how many seconds of a movie was viewed.
(Each “mediaPlay” event contains the sessionGroupID so it can be associated with that session.)
Given tens of millions of documents total, how would you write a query that totaled the elapsed viewing time of each movie, grouped by userType?
Desired query results:
premium users - total of "elapsed":
xmen: 500
starwars: 200
normal users - total of "elapsed":
xmen: 115
starwars: 25
If the data is not structured optimally for such a query, what would be the ideal structure?
Like this?
[
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 1,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 1,
"elapsed": 200
}
]
},
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 2,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 2,
"elapsed": 500
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 3,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 3,
"elapsed": 10
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 4,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 4,
"elapsed": 100
}
]
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 5,
"viewLog": [
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 5,
"elapsed": 5
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 5,
"elapsed": 25
}
]
}
]
Thanks for any and all guidance and advice!
Upvotes: 0
Views: 87
Reputation: 1891
The following query iterates over the collection and collect all session IDs grouped by the userTypes. Then it creates a subquery which iterates over the collection and collect all movies and the sum of the elapsed time where eventType
is "mediaPlay" and the collected sessions contains the sessionGroupID
.
The @@coll
is a bind parameter which includes your collection name.
FOR doc IN @@coll
FILTER doc.eventType == "sessionInfo"
COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
RETURN {
"userTypes" : userTypes,
"movies" : (
FOR event IN @@coll
FILTER event.sessionGroupID IN sessions
FILTER event.eventType == "mediaPlay"
COLLECT movie = event.productSKU INTO elapsed = event.elapsed
RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
)
}
The result of this query is:
[
{
"userTypes": "normal",
"movies": [
{
"movie": "starwars",
"elapsed": 25
},
{
"movie": "xmen",
"elapsed": 115
}
]
},
{
"userTypes": "premium",
"movies": [
{
"movie": "starwars",
"elapsed": 200
},
{
"movie": "xmen",
"elapsed": 500
}
]
}
]
Regarding your second question. Nested arrays/objects wouldn't optimise this query but you should split your data into two collections. One for every eventType
(e.g. name the collections like the eventType sessionInfo
and mediaPlay
). This reduces the number of needed filter statements and more important, it allows you to query separately over sessionInfos and mediaPlays which highly boost your performance.
The query would then look like:
FOR doc IN sessionInfo
COLLECT userTypes = doc.userType INTO sessions = doc.sessionGroupID
RETURN {
"userTypes" : userTypes,
"movies" : (
FOR event IN mediaPlay
FILTER event.sessionGroupID IN sessions
COLLECT movie = event.productSKU INTO elapsed = event.elapsed
RETURN { "movie" : movie, "elapsed" : SUM(elapsed) }
)
}
Upvotes: 2