Reputation: 121
I need to specify a custom index for my collection which I did with this function:
def insert_post_mongo (df):
if db.rss_crawler.estimated_document_count() == 0:
db.rss_crawler.create_index([("url_hashed", pymongo.HASHED)])
db.rss_crawler.insert_many(df.to_dict('records'))
My index comes from a url that I transform using the hashlib library:
posts_df['url_hashed'] = [hashlib.md5(x.encode()).hexdigest() for x in posts_df['link']]
However, not sure if this is the right way. My original idea was to create an Object_Id from that url but I haven't been able to figure out how. Object_id requires a 12-byte input or a 24-character hex string and I haven't found the way to do it. But still, not sure if that's even necessary or it's enough with having a secondary index.
Any ideas? Many thanks!
Raul.
Upvotes: 1
Views: 627
Reputation: 7578
I am pretty sure what you want is something like this to end up in the doc:
{
_id: ObjectId("5d8fcf7632c55e3d729b5541"), // primary key; not really important for this exercise
hashedURL: "b9056d71aca02a3a7fb860f66864fef0" // MD5 hash of URL
}
and you wish to do fast lookups on this. Create the index thusly:
db.rss_crawler.create_index( [("hashedURL", pymongo.ASCENDING) ] )
Now you will get index optimized performance when you do this:
h2 = hashlib.md5(targetURL.encode()).hexdigest()
for d in db.rss_crawler.find({"hashedURL":h2}):
print d
Upvotes: 1
Reputation: 8814
You're overthinking it. Just set the _id to whatever you choose and that will work. It doesn't need to be an ObjectId; that's just the default if it's not set.
Upvotes: 1