query mongodb with spark

Question

I am connecting MongoDB to Spark and I want to load data using a query.

df = sqlContext.read.format("com.mongodb.spark.sql").options(collection='test', query = {'name' :'jack'}).load()
df.show()

But it returns me the whole collection. How can I reproduce the same thing as this query db.test.find({'name':'jack'}) in Spark ?

zero323 · Accepted Answer

You can use filter or where to specify the condition:

from pyspark.sql.functions import col

df.filter(col("name") == "jack")

It will converted to an aggregation pipeline:

When using filters with DataFrames or Spark SQL, the underlying Mongo Connector code constructs an aggregation pipeline to filter the data in MongoDB before sending it to Spark.

query mongodb with spark

Answers (1)

Related Questions