cosmicsage
cosmicsage

Reputation: 155

Mongodb: Query on the last N documents(some portion) of a collection only

In my collection that has say 100 documents, I want to run the following query:

collection.find({"$text" : {"$search" : "some_string"})

Assume that a suitable "text" index already exists and thus my question is : How can I run this query on the last 'n' documents only?

All the question that I found on the web ask how to get the last n docs. Whereas My question is how to search on the last n docs only?

More generally my question is How can I run a mongo query on some portion say 20% of a collection.

What I tried

Im using pymongo so I tried to use skip() and limit() to get the last n documents but I didn't find a way to perform a query on cursor that the above mentioned function return.

After @hhsarh's anwser here's what I tried to no avail

# here's what I tried after initial answers
recents = information_collection.aggregate([
                {"$match" : {"$text" : {"$search" : "healthline"}}},
                {"$sort" : {"_id" : -1}},
                {"$limit" : 1},
            ])

The result is still coming from the whole collection instead of just the last record/document as the above code attempts.

The last document doesn't contain "healthline" in any field therefore the intended result of the query should be empty []. But I get a documents.

Please can someone tell how this can be possible

Upvotes: 2

Views: 565

Answers (2)

turivishal
turivishal

Reputation: 36104

It is not possible with $text operator, because there is a restriction,

The $match stage that includes a $text must be the first stage in the pipeline

It means we can't limit documents before $text operator, read more about $text operator restriction.


Second option this might possible if you use $regex regular expression operator instead of $text operator for searching,

And if you need to search same like $text operator you have modify your search input as below:

  • lets assume searchInput is your input variable
  • list of search field in searchFields
  • slice that search input string by space and loop that words array and convert it to regular expression
  • loop that search fields searchFields and prepare $in condition
searchInput = "This is search"
searchFields = ["field1", "field2"]
searchRegex = []
searchPayload = []

for s in searchInput.split(): searchRegex.append(re.compile(s, re.IGNORECASE));
for f in searchFields: searchPayload.append({ f: { "$in": searchRegex } })
print(searchPayload)

Now your input would look like,

[
  {'field1': {'$in': [/This/i, /is/i, /search/i]}}, 
  {'field2': {'$in': [/This/i, /is/i, /search/i]}}
]

Use that variable searchPayload with $or operator in search query at last stage using $in operator,

recents = information_collection.aggregate([
  # 1 = ascending, -1 descending you can use anyone as per your requirement
  { "$sort": { "_id": 1 } }, 
  # use any limit of number as per your requirement
  { "$limit": 10 }, 
  { "$match": { "$or": searchPayload } }
])

print(list(recents))

Note: The $regex regular expression search will cause performance issues.

To improve search performance you can create a compound index on your search fields like,

information_collection.createIndex({ field1: 1, field2: 1 });

Upvotes: 2

hhharsha36
hhharsha36

Reputation: 3349

What you are looking for can be achieved using MongoDB Aggregation

Note: As pointed out by @turivishal, $text won't work if it is not in the first stage of the aggregation pipeline.

collection.aggregate([
  {
    "$sort": {
      "_id": -1
    }
  },
  {
    "$limit": 10  // `n` value, where n is the number of last records you want to consider
  },
  {
    "$match" : {
      // All your find query goes here
    }
  },
], {allowDiskUse=true})  // just in case if the computation exceeds 100MB

Since _id is indexed by default, the above aggregation query should be faster. But, its performance reduces in proportion to the n value.

Note: Replace the last line in the code example with the below line if you are using pymongo

], allowDiskUse=True)

Upvotes: 2

Related Questions