How to avoid posting duplicates into elasticsearch using Nest .NET 6.x?

Question

When data from a device goes into the elastic there are duplicates. I like to avoid this duplicates. I'm using a object of IElasticClient, .NET and NEST to put data.

I searched for a method like ElasticClient.SetDocumentId(), but cant find.

_doc doc = (_doc)obj;
HashObject hashObject = new HashObject { DataRecordId = doc.DataRecordId, TimeStamp = doc.Timestamp };
// hashId should be the document ID.
int hashId = hashObject.GetHashCode();
ElasticClient.IndexDocumentAsync(doc);

I would like to update the data set inside the Elastic instead of adding one more same object right now.

Russ Cam · Accepted Answer

Assuming the following set up

var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var settings = new ConnectionSettings(pool)
    .DefaultIndex("example")
    .DefaultTypeName("_doc");

var client = new ElasticClient(settings);

public class HashObject
{
    public int DataRecordId { get; set; }
    public DateTime TimeStamp { get; set; }
}

If you want to set the Id for a document explicitly on the request, you can do so with

Fluent syntax

var indexResponse = client.Index(new HashObject(), i => i.Id("your_id"));

Object initializer syntax

var indexRequest = new IndexRequest(new HashObject(), id: "your_id");   
var indexResponse = client.Index(indexRequest);

both result in a request

PUT http://localhost:9200/example/_doc/your_id
{
  "dataRecordId": 0,
  "timeStamp": "0001-01-01T00:00:00"
}

As Rob pointed out in the question comments, NEST has a convention whereby it can infer the Id from the document itself, by looking for a property on the CLR POCO named Id. If it finds one, it will use that as the Id for the document. This does mean that an Id value ends up being stored in _source (and indexed, but you can disable this in the mappings), but it is useful because the Id value is automatically associated with the document and used when needed.

If HashObject is updated to have an Id value, now we can just do

Fluent syntax

var indexResponse = client.IndexDocument(new HashObject { Id = 1 });

Object initializer syntax

var indexRequest = new IndexRequest(new HashObject { Id = 1});  
var indexResponse = client.Index(indexRequest);

which will send the request

PUT http://localhost:9200/example/_doc/1
{
  "id": 1,
  "dataRecordId": 0,
  "timeStamp": "0001-01-01T00:00:00"
}

If your documents do not have an id field in the _source, you'll need to handle the _id values from the hits metadata from each hit yourself. For example

var searchResponse = client.Search(s => s
    .MatchAll()
);

foreach (var hit in searchResponse.Hits)
{
    var id = hit.Id;
    var document = hit.Source;

    // do something with them
}

How to avoid posting duplicates into elasticsearch using Nest .NET 6.x?

Answers (2)

Fluent syntax

Object initializer syntax

Fluent syntax

Object initializer syntax

Related Questions