rachel song
rachel song

Reputation: 54

How to insert data in sizes of 100-200GB to a collection faster? (pymilvus 2.4.3)

I am currently using pymilvus 2.4.3 and my data contains sparse vector.

I am currently using client.insert() but it has a 64mb rpc limit. I split my ~115GB data table into 1750 files using pyspark to a location on databricks and upload file by file. However, this takes about 1 minute per file, which means 1750 minutes will take a whopping 29 hours!

How do I insert my data to a collection faster? I know there is spark-milvus connector but it currently does not support sparse vector.

I also saw there was do_bulk_insert but i kept getting an error that says

    - taskID          : 450235310995975498,
    - state           : Failed,
    - row_count       : 0,
    - infos           : {'failed_reason': 'typeutil.GetDim should not invoke on sparse vector type', 'progress_percent': '0'},
    - id_ranges       : [],
    - create_ts       : 2024-06-04 15:44:37
>

I think this may be a bug, not sure why the bulk insert process is looking for a dimension value for sparse vector type.

Thanks!

Upvotes: 0

Views: 135

Answers (1)

Rashad Tockey
Rashad Tockey

Reputation: 193

Bulkinsertion can probably help you with faster data insertion. Looks like the most up-to-date version of Milvus supports bulk inserting sparse embeddings. https://milvus.io/docs/import-data.md

Bulk insertion speed is based on how many index nodes and datanodes you have. It will take you around 1-2 hours. Usually batch insertions can be at least 5-10 times faster than stream insert. Based on personal experience, using Zilliz cloud can accelerate your task more than 10 times since we have a large pool to do bulkinsertion and index build.

You can also configure node replica numbers and its cpu request/limit if you use helm or k8s operator.

Upvotes: 1

Related Questions