yifanwu
yifanwu

Reputation: 1645

self-join on documentdb syntax error

I'm having trouble doing an otherwise SQL valid self-join query on documentdb.

So the following query works: SELECT * FROM c AS c1 WHERE c1.obj="car"

But this simple self join query does not: SELECT c1.url FROM c AS c1 JOIN c AS c2 WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url, with the error, Identifier 'c' could not be resolved.

It seems that documendb supports self-joins within the document, but I'm asking on the collection level.

I looked at the official syntax doc and understand that the collection name is basically inferred; I tried changing c to explicitly my collection name and root but neither worked.

Am I missing something obvious? Thanks!

Upvotes: 3

Views: 4344

Answers (2)

Naveen
Naveen

Reputation: 27

Above Answer has queries mentioned by @Andrew Liu. This will resolve your error but Azure Cosmos DB does not support Cross-item and cross-container joins. Use this link to read about joins https://learn.microsoft.com/en-us/azure/cosmos-db/sql/sql-query-join

Upvotes: 0

Andrew Liu
Andrew Liu

Reputation: 8119

A few things to clarify:

1.) Regarding Identifier 'c' could not be resolved

Queries are scoped to a single collection; and in the example above, c is an implicit alias for the collection which is being re-aliased to c1 with the AS keyword.

You can fix the example query changing fixing the JOIN to reference c1:

SELECT c1.url
FROM c AS c1
JOIN c1 AS c2
WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url`

This is also equivalent to:

SELECT c1.url
FROM c1
JOIN c1 AS c2
WHERE c1.obj="car" AND c2.obj="person" AND c1.url = c2.url`

2.) Understanding JOINs and examining your data model

With that said, I don't think fixing the query syntax issue above will produce the behavior you are expecting. The JOIN keyword in DocumentDB SQL is designed for forming a cross product with a denormalized array of elements within a document (as opposed to forming cross products across other documents in the same collection). If you run in to struggles here, it may be worth taking a step back and revisiting how to model your data for Azure Cosmos DB.

In a RDBMS, you are trained to think entity-first and normalize your data model based on entities. You rely heavily on a query engine to optimize queries to fit your workload (which typically do a good, but not always optimal, job for retrieving data). The challenges here are that many relational benefits are lost as scale increases, and scaling out to multiple shards/partitions becomes a requirement.

For a scale-out distributed database like Cosmos DB, you will want to start with understanding the workload first and optimize your data model to fit the workload (as opposed to thinking entity first). You'll want to keep in mind that collections are merely a logical abstraction composed of many replicas that live within partition sets. They do not enforce schema and are the boundary for queries.

When designing your model, you will want to incorporate the following questions in to your thought process:

  • What is the scale, in terms of size and throughput, for the broader solution (an estimate of order of magnitude is sufficient)?

  • What is the ratio of reads vs writes?

  • For writes - what is the pattern for writes? Is it mostly inserts, or are there a lot of updates?

  • For reads - what do top N queries look like?

The above should influence your choice of partition key as well as what your data / object model should look like. For example:

  • The ratio of requests will help guide how you make tradeoffs (use Pareto principle and optimize for the bulk of your workload).
  • For read-heavy workloads, commonly filtered properties become candidates for choice of partition key.
  • Properties that tend to be updated together frequently should be abstracted together in the data model, and away from properties that get updated with a slower cadence (to lower the RU charge for updates).
  • Don't be afraid to duplicate properties to enrich queryability, and annotate types, across different record types. For example, we have two types of documents: cat and person.

    {
       "id": "Andrew",
       "type": "Person",
       "familyId": "Liu",
       "employer": "Microsoft"
    }
     
    {
       "id": "Ralph",
       "type": "Cat",
       "familyId": "Liu",
       "fur": {
             "length": "short",
             "color": "brown"
       }
    }
     

We can query both types of documents without needing a JOIN simply by running a query without a filter on type:

SELECT * FROM c WHERE c.familyId = "Liu"

And if we wanted to filter on type = “Person”, we can simply add a filter on type to our query:

SELECT * FROM c WHERE c.familyId = "Liu" AND c.type = "Person"

Upvotes: 5

Related Questions