Frank Mehlhop
Frank Mehlhop

Reputation: 2222

How to avoid posting duplicates into elasticsearch using Nest .NET 6.x?

When data from a device goes into the elastic there are duplicates. I like to avoid this duplicates. I'm using a object of IElasticClient, .NET and NEST to put data.

I searched for a method like ElasticClient.SetDocumentId(), but cant find.

_doc doc = (_doc)obj;
HashObject hashObject = new HashObject { DataRecordId = doc.DataRecordId, TimeStamp = doc.Timestamp };
// hashId should be the document ID.
int hashId = hashObject.GetHashCode();
ElasticClient.IndexDocumentAsync(doc);

I would like to update the data set inside the Elastic instead of adding one more same object right now.

Upvotes: 1

Views: 1048

Answers (2)

Frank Mehlhop
Frank Mehlhop

Reputation: 2222

Thank you very much Russ for this detailed and easy to understand description! :-)

The HashObject should be just a helper to get a unique ID from my real _doc object. Now I add a Id property to my _doc class and the rest I will show with my code below. I get now duplicates any more into the Elastic.

public void Create(object obj)
{
    _doc doc = (_doc)obj;
    string idAsString = doc.DataRecordId.ToString() + doc.Timestamp.ToString();
    int hashId = idAsString.GetHashCode();
    doc.Id = hashId;
    ElasticClient.IndexDocumentAsync(doc);
}

Upvotes: 0

Russ Cam
Russ Cam

Reputation: 125528

Assuming the following set up

var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var settings = new ConnectionSettings(pool)
    .DefaultIndex("example")
    .DefaultTypeName("_doc");

var client = new ElasticClient(settings);

public class HashObject
{
    public int DataRecordId { get; set; }
    public DateTime TimeStamp { get; set; }
}

If you want to set the Id for a document explicitly on the request, you can do so with

Fluent syntax

var indexResponse = client.Index(new HashObject(), i => i.Id("your_id"));

Object initializer syntax

var indexRequest = new IndexRequest<HashObject>(new HashObject(), id: "your_id");   
var indexResponse = client.Index(indexRequest);

both result in a request

PUT http://localhost:9200/example/_doc/your_id
{
  "dataRecordId": 0,
  "timeStamp": "0001-01-01T00:00:00"
}

As Rob pointed out in the question comments, NEST has a convention whereby it can infer the Id from the document itself, by looking for a property on the CLR POCO named Id. If it finds one, it will use that as the Id for the document. This does mean that an Id value ends up being stored in _source (and indexed, but you can disable this in the mappings), but it is useful because the Id value is automatically associated with the document and used when needed.

If HashObject is updated to have an Id value, now we can just do

Fluent syntax

var indexResponse = client.IndexDocument(new HashObject { Id = 1 });

Object initializer syntax

var indexRequest = new IndexRequest<HashObject>(new HashObject { Id = 1});  
var indexResponse = client.Index(indexRequest);

which will send the request

PUT http://localhost:9200/example/_doc/1
{
  "id": 1,
  "dataRecordId": 0,
  "timeStamp": "0001-01-01T00:00:00"
}

If your documents do not have an id field in the _source, you'll need to handle the _id values from the hits metadata from each hit yourself. For example

var searchResponse = client.Search<HashObject>(s => s
    .MatchAll()
);

foreach (var hit in searchResponse.Hits)
{
    var id = hit.Id;
    var document = hit.Source;

    // do something with them
}

Upvotes: 1

Related Questions