TheM00s3
TheM00s3

Reputation: 3711

Updating mongoData with MongoSpark

From the following tutorial provided by Mongo:

MongoSpark.save(centenarians.write.option("collection", "hundredClub").mode("overwrite"))

am I correct in understanding that What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?

My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,

lets say I've got data that looks like

{"_id" : ObjectId(12345), "name" : "John" , "Occupation" : "Baker"}

What I would then like to do is to merge the record of the person from another file that has more details, I.E. that file looks like

{"name" : "John", "address" : "1800 some street"}

the goal is to update the record in Mongo so now the JSON looks like

{"_id" : ObjectId(12345) "name" : "John" , "address" : 1800 some street", "Occupation" : "Baker"}

Now here's the thing, lets assume that we just want to update John, and that there are millions of other records that we would like to leave as is.

Upvotes: 4

Views: 8325

Answers (1)

Wan B.
Wan B.

Reputation: 18835

There are a few questions here, I'll try to break them down.

What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?

Correct, as of mongo-spark v2.x, if you specify mode overwrite, MongoDB Connector for Spark will first drop the collection the save new result into the collection. See source snippet for more information.

My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,

The patch described on SPARK-66 (mongo-spark v1.1+) is , if a dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted. 

What I would then like to do is to merge the record of the person from another file that has more details

As mentioned above, you need to know the _id value from your collection. Example steps:

  1. Create a dataframe (A) by reading from your Person collection to retrieve John's _id value. i.e. ObjectId(12345).
  2. Merge _id value of ObjectId(12345) into your dataframe (B - from the other file with more information). Utilise unique field value to join the two dataframes (A and B).
  3. Save the merged dataframe (C). Without specifying overwrite mode.

we just want to update John, and that there are millions of other records that we would like to leave as is.

In that case, before you merge the two dataframes, filter out any unwanted records from dataframe B (the one from the other file with more details). In addition, when you call save(), specify mode append.

Upvotes: 5

Related Questions