Reputation: 3711
From the following tutorial provided by Mongo:
MongoSpark.save(centenarians.write.option("collection", "hundredClub").mode("overwrite"))
am I correct in understanding that What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
My question is then is it possible to use the MongoSpark
connector to actually update records in Mongo,
lets say I've got data that looks like
{"_id" : ObjectId(12345), "name" : "John" , "Occupation" : "Baker"}
What I would then like to do is to merge the record of the person from another file that has more details, I.E. that file looks like
{"name" : "John", "address" : "1800 some street"}
the goal is to update the record in Mongo so now the JSON looks like
{"_id" : ObjectId(12345) "name" : "John" , "address" : 1800 some street", "Occupation" : "Baker"}
Now here's the thing, lets assume that we just want to update John
, and that there are millions of other records that we would like to leave as is.
Upvotes: 4
Views: 8325
Reputation: 18835
There are a few questions here, I'll try to break them down.
What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
Correct, as of mongo-spark v2.x, if you specify mode overwrite
, MongoDB Connector for Spark will first drop the collection the save new result into the collection. See source snippet for more information.
My question is then is it possible to use the
MongoSpark
connector to actually update records in Mongo,
The patch described on SPARK-66 (mongo-spark
v1.1+) is , if a dataframe contains an _id
field, the data will be upserted. Which means any existing documents with the same _id
value will be updated and new documents without existing _id
value in the collection will be inserted.
What I would then like to do is to merge the record of the person from another file that has more details
As mentioned above, you need to know the _id
value from your collection. Example steps:
Person
collection to retrieve John
's _id
value. i.e. ObjectId(12345)
. _id
value of ObjectId(12345)
into your dataframe (B - from the other file with more information). Utilise unique field value to join the two dataframes (A and B).overwrite
mode. we just want to update
John
, and that there are millions of other records that we would like to leave as is.
In that case, before you merge the two dataframes, filter out any unwanted records from dataframe B (the one from the other file with more details). In addition, when you call save()
, specify mode append
.
Upvotes: 5