chhantyal
chhantyal

Reputation: 12252

Azure DocumentDB with MongoDB Protocol Spark integration

I want to use DocumentDB but there is no connector for PySpark. Looks like DocumentDB also supports MongoDB Protocol as mentioned here, which means all existing MongoDB drivers should work. Since there is PySpark connector for MongoDB, I wanted to try this out.

df = spark.read.format("com.mongodb.spark.sql.DefaultSource").load()

This throws error.

com.mongodb.MongoCommandException: Command failed with error 115: ''$sample' is not supported' on server example.documents.azure.com:10250. The full response is { "_t" : "OKMongoResponse", "ok" : 0, "code" : 115, "errmsg" : "'$sample' is not supported", "$err" : "'$sample' is not supported" }

It looks like DocumentDB MongoDB API doesn't support all MongoDB features, but I can't find any documentation about. Or am I missing something else?

Upvotes: 1

Views: 311

Answers (1)

Stennie
Stennie

Reputation: 65313

I want to use DocumentDB but there is no connector for PySpark.

A preview of a Spark to DocumentDB connector (including a pyDocumentDB package) was made available in early April 2017.

Looks like DocumentDB also supports MongoDB Protocol as mentioned here, which means all existing MongoDB drivers should work

DocumentDB supports the MongoDB wire protocol for communication and reports its version as MongoDB 3.2.0, but this does not mean that it is a drop-in replacement with full support for all MongoDB features (or that DocumentDB implements features with identical behaviour and limits). A notable absence at the moment is any support for MongoDB's aggregation pipeline, which includes the $sample operator that the PySpark connector is expecting to be available given a connection to a server claiming to be MongoDB 3.2.

You can find more examples of potential compatibility issues in the comments on the DocumentDB API for MongoDB documentation you referenced in your question.

Upvotes: 3

Related Questions