How Azure Search stores the index data

Question

I have a lot of data stored in Azure Search. And I'm too greedy so decided to understand how the index data is stored to predict its size and service costs.

Spoiler: According to the experiment field name length does not impact the storage used for the index

Input (examples at the end)

Data structure with Id + 9 string fields. All fields have long names. Name length to data length is 24 to 37

Record example:

{
    "Id": "55bd7474-1e48-464c-a54d-bc2f3d8b0383",
    "MySuperLongNameProperty": "0e2c5f5e-9464-4030-bf3f-9de41181faff",
    "MySuperLongName2Property": "aa521300-1925-4dd6-97f2-f27fed1b720e",
    "MySuperLongName3Property": "9eec9f1f-d970-4581-8677-92cd735c9d80",
    "MySuperLongName4Property": "e3b4619b-bb8c-4fa2-82b2-55287f4262ae",
    "MySuperLongName5Property": "e6b79880-650d-4733-b91a-e5a4e066811d",
    "MySuperLongName6Property": "d391c66c-f3c6-45e2-96ef-80ab682fa07b",
    "MySuperLongName7Property": "62a92d68-74e6-41b1-8f92-ac3795b649cd",
    "MySuperLongName8Property": "83510497-a6b0-4d6e-9130-0f8deefd73db",
    "MySuperLongName9Property": "977e397e-5fc9-4677-afaf-52b9ea0a8f23"
}

Data structure with Id + 9 string fields. All fields have short names. Name length to data length is 3 to 37

Record example:

{
    "Id": "f403f9ce-b343-4e38-bc4b-24d300eb13fb",
    "mp": "10970b17-62fe-431a-bf4f-d5a17266c4dc",
    "m2p": "b338290b-069b-4494-8c9e-8da85aad0990",
    "m3p": "1be76d7f-07d2-4648-9888-ed15ec7b3857",
    "m4p": "327206c8-561c-4651-95e0-06c58f83739a",
    "m5p": "241b2be7-9aac-41f9-b669-c5c768acd42e",
    "m6p": "55a1691a-d525-442e-b369-380d2480f2b1",
    "m7p": "a1263c81-022b-4f59-97fe-8916e1457d35",
    "m8p": "b4a4819b-185b-46ab-8e34-838fbc8a598a",
    "m9p": "38bc1df8-81cf-4005-bb14-2fe8a1c6797a"
}

Experiments

For each experiment I used Guid data to populate all fields (.NET Guid.NewGuid().ToString()).

Also, experiments are executed as N batches * 1000 items:

let insert<'t> (client: ISearchIndexClient) (docs: 't list) =
        let actions = docs |> Seq.ofList |> Seq.map(fun x -> IndexAction.Upload x) |> Seq.cast>
        let batch = IndexBatch.New(actions)
        client.Documents.Index batch |> ignore

for x in [1..1000] do
  let batch = [1..1000] |> List.map(fun i -> {.. generate record ..})
  insert batch

So, some numbers:

Adding 1.2M records to index

Long name storage size: 1.68Gb

Short name storage size: 1.65Gb
Add 3M records to index

Long name storage size: 5.53Gb (~2Gb raw JSON text data)

Short name storage size: 4.11Gb (~1.5Gb raw JSON text data)

After 10-20 minutes, suddenly, the overall size was reduced automatically

Long name storage size: 4.04Gb

Short name storage size: 4.06Gb

Originally, I expected to see the behavior described here. But after the 2nd experiment the size difference was significant (the index was not compressed yet).

After all, I assume there are few strategies on how to store the index data. Maybe for small indexes field names are compressed automatically. While for larger ones it stores as is, but schedules the background service for further compression.

As result, as far as I can see there is no difference in fields naming as the length of field name will not impact the storage size

Any thoughts or explanations?

How Azure Search stores the index data

Answers (1)

Related Questions