Reputation: 21
I am building an application which uses llm inference, so I tried vllm. Since there is little to no information available for this library I wish to seek the help of experts.
The code snippet that I have attached below is just for testing whether vllm is working in my system.
Here is the code:
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="facebook/opt-125m")
outputs = llm.generate(prompts, sampling_params)
Traceback:
WARNING 09-28 10:28:51 _custom_ops.py:18] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
WARNING 09-28 10:28:57 config.py:1656] Casting torch.float16 to torch.bfloat16.
WARNING 09-28 10:28:57 config.py:376] Async output processing is only supported for CUDA or TPU. Disabling it for other platforms.
WARNING 09-28 10:28:57 config.py:681] Possibly too large swap space. 4.00 GiB out of the 7.09 GiB total CPU memory is allocated for the swap space.
INFO 09-28 10:28:57 llm_engine.py:226] Initializing an LLM engine (v0.6.1.dev238+ge2c6e0a82) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, num_scheduler_steps=1, multi_step_stream_outputs=False, enable_prefix_caching=False, use_async_output_proc=False, use_cached_outputs=False, mm_processor_kwargs=None)
/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/transformers/tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
WARNING 09-28 10:28:59 cpu_executor.py:328] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 09-28 10:28:59 cpu_executor.py:354] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 09-28 10:28:59 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 09-28 10:28:59 selector.py:116] Using XFormers backend.
/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 09-28 10:28:59 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 09-28 10:28:59 selector.py:116] Using XFormers backend.
INFO 09-28 10:29:00 weight_utils.py:242] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/weight_utils.py:424: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.68it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 1.68it/s]
INFO 09-28 10:29:01 cpu_executor.py:212] # CPU blocks: 7281
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][rank0]: Traceback (most recent call last):
[rank0]: File "/home/dharsann/Documents/llm/test.py", line 14, in <module>
[rank0]: outputs = llm.generate(prompts, sampling_params)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/utils.py", line 1047, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 388, in generate
[rank0]: outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 877, in _run_engine
[rank0]: step_outputs = self.llm_engine.step()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 1264, in step
[rank0]: outputs = self.model_executor.execute_model(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/executor/cpu_executor.py", line 227, in execute_model
[rank0]: output = self.driver_method_invoker(self.driver_worker,
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/executor/cpu_executor.py", line 377, in _driver_method_invoker
[rank0]: return getattr(driver, method)(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 303, in execute_model
[rank0]: inputs = self.prepare_input(execute_model_req)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 291, in prepare_input
[rank0]: return self._get_driver_input_and_broadcast(execute_model_req)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 253, in _get_driver_input_and_broadcast
[rank0]: self.model_runner.prepare_model_input(
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/worker/cpu_model_runner.py", line 494, in prepare_model_input
[rank0]: model_input = self._prepare_model_input_tensors(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/worker/cpu_model_runner.py", line 482, in _prepare_model_input_tensors
[rank0]: return builder.build() # type: ignore
[rank0]: ^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/worker/cpu_model_runner.py", line 130, in build
[rank0]: multi_modal_kwargs) = self._prepare_prompt(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/worker/cpu_model_runner.py", line 265, in _prepare_prompt
[rank0]: attn_metadata = self.attn_backend.make_metadata(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/dharsann/Documents/llm/.venv/lib/python3.12/site-packages/vllm/attention/backends/abstract.py", line 47, in make_metadata
[rank0]: return cls.get_metadata_cls()(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: TypeError: XFormersMetadata.__init__() got an unexpected keyword argument 'is_prompt'
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Upvotes: 2
Views: 341