dr11
dr11

Reputation: 5726

How Azure Search stores the index data

I have a lot of data stored in Azure Search. And I'm too greedy so decided to understand how the index data is stored to predict its size and service costs.

Spoiler: According to the experiment field name length does not impact the storage used for the index

Input (examples at the end)

  1. Data structure with Id + 9 string fields. All fields have long names. Name length to data length is 24 to 37

    Record example:

    {
        "Id": "55bd7474-1e48-464c-a54d-bc2f3d8b0383",
        "MySuperLongNameProperty": "0e2c5f5e-9464-4030-bf3f-9de41181faff",
        "MySuperLongName2Property": "aa521300-1925-4dd6-97f2-f27fed1b720e",
        "MySuperLongName3Property": "9eec9f1f-d970-4581-8677-92cd735c9d80",
        "MySuperLongName4Property": "e3b4619b-bb8c-4fa2-82b2-55287f4262ae",
        "MySuperLongName5Property": "e6b79880-650d-4733-b91a-e5a4e066811d",
        "MySuperLongName6Property": "d391c66c-f3c6-45e2-96ef-80ab682fa07b",
        "MySuperLongName7Property": "62a92d68-74e6-41b1-8f92-ac3795b649cd",
        "MySuperLongName8Property": "83510497-a6b0-4d6e-9130-0f8deefd73db",
        "MySuperLongName9Property": "977e397e-5fc9-4677-afaf-52b9ea0a8f23"
    }
    
  2. Data structure with Id + 9 string fields. All fields have short names. Name length to data length is 3 to 37

    Record example:

    {
        "Id": "f403f9ce-b343-4e38-bc4b-24d300eb13fb",
        "mp": "10970b17-62fe-431a-bf4f-d5a17266c4dc",
        "m2p": "b338290b-069b-4494-8c9e-8da85aad0990",
        "m3p": "1be76d7f-07d2-4648-9888-ed15ec7b3857",
        "m4p": "327206c8-561c-4651-95e0-06c58f83739a",
        "m5p": "241b2be7-9aac-41f9-b669-c5c768acd42e",
        "m6p": "55a1691a-d525-442e-b369-380d2480f2b1",
        "m7p": "a1263c81-022b-4f59-97fe-8916e1457d35",
        "m8p": "b4a4819b-185b-46ab-8e34-838fbc8a598a",
        "m9p": "38bc1df8-81cf-4005-bb14-2fe8a1c6797a"
    }
    

Experiments

For each experiment I used Guid data to populate all fields (.NET Guid.NewGuid().ToString()).

Also, experiments are executed as N batches * 1000 items:

let insert<'t> (client: ISearchIndexClient) (docs: 't list) =
        let actions = docs |> Seq.ofList |> Seq.map(fun x -> IndexAction.Upload x) |> Seq.cast<IndexAction<'t>>
        let batch = IndexBatch.New(actions)
        client.Documents.Index batch |> ignore

for x in [1..1000] do
  let batch = [1..1000] |> List.map(fun i -> {.. generate record ..})
  insert batch

So, some numbers:

  1. Adding 1.2M records to index

    Long name storage size: 1.68Gb

    Short name storage size: 1.65Gb

  2. Add 3M records to index

    Long name storage size: 5.53Gb (~2Gb raw JSON text data)

    Short name storage size: 4.11Gb (~1.5Gb raw JSON text data)

    After 10-20 minutes, suddenly, the overall size was reduced automatically

    Long name storage size: 4.04Gb

    Short name storage size: 4.06Gb

Originally, I expected to see the behavior described here. But after the 2nd experiment the size difference was significant (the index was not compressed yet).

After all, I assume there are few strategies on how to store the index data. Maybe for small indexes field names are compressed automatically. While for larger ones it stores as is, but schedules the background service for further compression.

As result, as far as I can see there is no difference in fields naming as the length of field name will not impact the storage size

Any thoughts or explanations?

Upvotes: 1

Views: 440

Answers (1)

ramero-MSFT
ramero-MSFT

Reputation: 980

Indeed, the name you give to your fields should generally have a negligible impact on the overalls size of your index. Each of the document's field exist on disk under multiple different forms (depending on which features are enable for that field, such as searchable, filterable, sortable, etc.). Most of those forms are heavily optimized to serve their specific needs, and in most cases, the field names don't need to be included in the files that contains them. However, the full json original documents are also stored alongside the indexed versions (so the document can be retrieved). Since the "original" documents will include the field names, technically, there will be some linear correlation between the length of the fields and the overall size of your index, however, the correlation should have a pretty weak coefficient. The best way to verify what is that coefficient is through tests (which you have already done), since each uses cases will be different.

Upvotes: 2

Related Questions