Reputation: 3427
Our SQL Server database table consists of approximately 2 million records. The Azure Search index we have that imports from it is now up to 9.5 million documents. Our indexer is set to run every 24 hours. We have Change Tracking turned ON in our database. We need to know how to properly configure the indexer to only import new records every 24 hours.
I am wondering if the issue is coming in with how our database writes its records: every 24 hours all 2 million records are gathered from the web and written to a staging table. Then a stored procedure compares the data in the staging table to the master data table. If the records for a certain Source ID are matched between the staging table and the master table, the staging table records will replace those records in the master table, even the identical records. The records in the master table with a source id that don't exist in the staging table are left alone. So even though most of those records being written to the master data table are just replacing existing identical data, the Azure Search indexer must see those as new records?
At any given time, if our database only has 2 million records, then the Azure Search index should only have 2 million documents. But it continues to add documents, seemingly based on the aforementioned and this is what we need to know how to fix.
EDIT: I noted an option for a "Soft Delete Policy." Is this something that would help us, as it would also delete documents in the index which correspond with records in the database that have been replaced?
EDIT #2: I've uploaded an Excel spreadsheet here which diagrams the flow I'm trying to explain, in case it helps to visualize it: https://dl.dropboxusercontent.com/u/8477791/v4AzureSearchIndexFlow.xlsx
Upvotes: 0
Views: 724
Reputation: 8634
The fact that new documents continue to accumulate has nothing to do with change detection or deletion detection. The issue is that your source rows get new IDs when they change.
Azure Search bases document identity solely on the value of the key field of your index. Change tracking is about what is read from the data source; It has no bearing on how that data is indexed. The only factor there is the ID of each document. Documents with matching IDs will be updated, while IDs that aren't in the index will result in new documents being uploaded.
One possible workaround is to use a different column of your SQL table as the document key, but this only works if
Upvotes: 0