Mike J
Mike J

Reputation: 159

MarkLogic search query multiple duplicates, grab latest results based off timestamp field

I've got a dataset that looks like this:

    {
     "id": 1234,
     "name": "name-1",
     "timestamp: "2022-02-10T13:48:32.195344Z",
    }

There are several thousand documents with unique id's in the database I'm trying to grab. However the issue is the "name" property can change with each id, so there can be multiple documents that have the same id but different name.

For my result set I'm trying to return a full list of all the documents of each unique id-name pair, but only one document for each pair that is the document with the latest timestamp.

For example with four documents with the id of "1234" and "name-2", I'd only want to return one of those documents that has the latest timestamp.

There are millions of documents in this dataset, so performance is a concern here as well.

Is there any way to build this query with cts functions or similar?

Upvotes: 1

Views: 65

Answers (1)

Mads Hansen
Mads Hansen

Reputation: 66781

If you are able to process all of the documents in one module execution, then you could do something like this, to use a map with a composite key using the id and name as the key, and then test whether there is a timestamp and if greater or less than the current docs timestamp:

let $map := map:new()
let $_ :=
  for $doc in cts:search(doc(), $query)
  let $id := $doc/id
  let $name := $doc/name
  let $timestamp := $doc/timestamp/xs:dateTime(.)
  let $key := $id || $name
  let $old_value := map:get($map, $key)/timestamp/xs:dateTime(.)
  where not(exists($old_value) and $old_value gt $timestamp)
  return map:put($map, $key, $doc)
return 
  for $key in map:keys($map)
  return map:get($map, $key)

However, that might be slow and runs the risk of blowing the Expanded Tree Cache. If you can index those fields, then you could use cts:value-tuples() with similar logic:

let $map := map:new()
let $_ :=
  for $tuple in cts:value-tuples(
    ( 
      cts:element-reference(xs:QName("id")),
      cts:element-reference(xs:QName("name")),
      cts:element-reference(xs:QName("timestamp")),
      cts:uri-reference()
    ),
    (), 
    $query
  )
  let $id := $tuple[1]
  let $name := $tuple[2]
  let $timestamp := $tuple[3]
  let $key := $id || $name
  let $old_value := map:get($map, $key)[3]
  where not(exists($old_value) and $old_value gt $timestamp)
  return map:put($map, $key, $tuple)
return 
  for $key in map:keys($map)
  return (: this returns the URI, you could also wrap with doc() to fetch them :)
    map:get($map, $key)[4]

Upvotes: 0

Related Questions