raul
raul

Reputation: 121

Set custom index in mongodb with pymongo or _id

I need to specify a custom index for my collection which I did with this function:

def insert_post_mongo (df):
    if db.rss_crawler.estimated_document_count() == 0:
        db.rss_crawler.create_index([("url_hashed", pymongo.HASHED)])
    db.rss_crawler.insert_many(df.to_dict('records'))

My index comes from a url that I transform using the hashlib library:

posts_df['url_hashed'] = [hashlib.md5(x.encode()).hexdigest() for x in posts_df['link']]

However, not sure if this is the right way. My original idea was to create an Object_Id from that url but I haven't been able to figure out how. Object_id requires a 12-byte input or a 24-character hex string and I haven't found the way to do it. But still, not sure if that's even necessary or it's enough with having a secondary index.

Any ideas? Many thanks!

Raul.

Upvotes: 1

Views: 627

Answers (2)

Buzz Moschetti
Buzz Moschetti

Reputation: 7578

I am pretty sure what you want is something like this to end up in the doc:

{
  _id: ObjectId("5d8fcf7632c55e3d729b5541"), // primary key; not really important for this exercise
  hashedURL: "b9056d71aca02a3a7fb860f66864fef0"  // MD5 hash of URL
}

and you wish to do fast lookups on this. Create the index thusly:

db.rss_crawler.create_index( [("hashedURL", pymongo.ASCENDING) ] )

Now you will get index optimized performance when you do this:

h2 = hashlib.md5(targetURL.encode()).hexdigest()
for d in db.rss_crawler.find({"hashedURL":h2}):
    print d

Upvotes: 1

Belly Buster
Belly Buster

Reputation: 8814

You're overthinking it. Just set the _id to whatever you choose and that will work. It doesn't need to be an ObjectId; that's just the default if it's not set.

Upvotes: 1

Related Questions