Reputation: 5726
I have a lot of data stored in Azure Search. And I'm too greedy so decided to understand how the index data is stored to predict its size and service costs.
Spoiler: According to the experiment field name length does not impact the storage used for the index
Input (examples at the end)
Data structure with Id
+ 9 string fields. All fields have long names. Name length to data length is 24 to 37
Record example:
{
"Id": "55bd7474-1e48-464c-a54d-bc2f3d8b0383",
"MySuperLongNameProperty": "0e2c5f5e-9464-4030-bf3f-9de41181faff",
"MySuperLongName2Property": "aa521300-1925-4dd6-97f2-f27fed1b720e",
"MySuperLongName3Property": "9eec9f1f-d970-4581-8677-92cd735c9d80",
"MySuperLongName4Property": "e3b4619b-bb8c-4fa2-82b2-55287f4262ae",
"MySuperLongName5Property": "e6b79880-650d-4733-b91a-e5a4e066811d",
"MySuperLongName6Property": "d391c66c-f3c6-45e2-96ef-80ab682fa07b",
"MySuperLongName7Property": "62a92d68-74e6-41b1-8f92-ac3795b649cd",
"MySuperLongName8Property": "83510497-a6b0-4d6e-9130-0f8deefd73db",
"MySuperLongName9Property": "977e397e-5fc9-4677-afaf-52b9ea0a8f23"
}
Data structure with Id
+ 9 string fields. All fields have short names. Name length to data length is 3 to 37
Record example:
{
"Id": "f403f9ce-b343-4e38-bc4b-24d300eb13fb",
"mp": "10970b17-62fe-431a-bf4f-d5a17266c4dc",
"m2p": "b338290b-069b-4494-8c9e-8da85aad0990",
"m3p": "1be76d7f-07d2-4648-9888-ed15ec7b3857",
"m4p": "327206c8-561c-4651-95e0-06c58f83739a",
"m5p": "241b2be7-9aac-41f9-b669-c5c768acd42e",
"m6p": "55a1691a-d525-442e-b369-380d2480f2b1",
"m7p": "a1263c81-022b-4f59-97fe-8916e1457d35",
"m8p": "b4a4819b-185b-46ab-8e34-838fbc8a598a",
"m9p": "38bc1df8-81cf-4005-bb14-2fe8a1c6797a"
}
Experiments
For each experiment I used Guid data to populate all fields (.NET Guid.NewGuid().ToString()
).
Also, experiments are executed as N batches * 1000 items:
let insert<'t> (client: ISearchIndexClient) (docs: 't list) =
let actions = docs |> Seq.ofList |> Seq.map(fun x -> IndexAction.Upload x) |> Seq.cast<IndexAction<'t>>
let batch = IndexBatch.New(actions)
client.Documents.Index batch |> ignore
for x in [1..1000] do
let batch = [1..1000] |> List.map(fun i -> {.. generate record ..})
insert batch
So, some numbers:
Adding 1.2M records to index
Long name storage size: 1.68Gb
Short name storage size: 1.65Gb
Add 3M records to index
Long name storage size: 5.53Gb (~2Gb raw JSON text data)
Short name storage size: 4.11Gb (~1.5Gb raw JSON text data)
After 10-20 minutes, suddenly, the overall size was reduced automatically
Long name storage size: 4.04Gb
Short name storage size: 4.06Gb
Originally, I expected to see the behavior described here. But after the 2nd experiment the size difference was significant (the index was not compressed yet).
After all, I assume there are few strategies on how to store the index data. Maybe for small indexes field names are compressed automatically. While for larger ones it stores as is, but schedules the background service for further compression.
As result, as far as I can see there is no difference in fields naming as the length of field name will not impact the storage size
Any thoughts or explanations?
Upvotes: 1
Views: 440
Reputation: 980
Indeed, the name you give to your fields should generally have a negligible impact on the overalls size of your index. Each of the document's field exist on disk under multiple different forms (depending on which features are enable for that field, such as searchable, filterable, sortable, etc.). Most of those forms are heavily optimized to serve their specific needs, and in most cases, the field names don't need to be included in the files that contains them. However, the full json original documents are also stored alongside the indexed versions (so the document can be retrieved). Since the "original" documents will include the field names, technically, there will be some linear correlation between the length of the fields and the overall size of your index, however, the correlation should have a pretty weak coefficient. The best way to verify what is that coefficient is through tests (which you have already done), since each uses cases will be different.
Upvotes: 2