Reputation: 54
I am currently using pymilvus 2.4.3 and my data contains sparse vector.
I am currently using client.insert() but it has a 64mb rpc limit. I split my ~115GB data table into 1750 files using pyspark to a location on databricks and upload file by file. However, this takes about 1 minute per file, which means 1750 minutes will take a whopping 29 hours!
How do I insert my data to a collection faster? I know there is spark-milvus connector but it currently does not support sparse vector.
I also saw there was do_bulk_insert but i kept getting an error that says
- taskID : 450235310995975498,
- state : Failed,
- row_count : 0,
- infos : {'failed_reason': 'typeutil.GetDim should not invoke on sparse vector type', 'progress_percent': '0'},
- id_ranges : [],
- create_ts : 2024-06-04 15:44:37
>
I think this may be a bug, not sure why the bulk insert process is looking for a dimension value for sparse vector type.
Thanks!
Upvotes: 0
Views: 135
Reputation: 193
Bulkinsertion can probably help you with faster data insertion. Looks like the most up-to-date version of Milvus supports bulk inserting sparse embeddings. https://milvus.io/docs/import-data.md
Bulk insertion speed is based on how many index nodes and datanodes you have. It will take you around 1-2 hours. Usually batch insertions can be at least 5-10 times faster than stream insert. Based on personal experience, using Zilliz cloud can accelerate your task more than 10 times since we have a large pool to do bulkinsertion and index build.
You can also configure node replica numbers and its cpu request/limit if you use helm or k8s operator.
Upvotes: 1