Reputation: 23
Im trying to serve the model: unsloth/Llama-3.1-Nemotron-70B-Instruct-bnb-4bit. But I'm having an error and I don't really understand it at all.
This is my command code:
sudo docker run --gpus all \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=my_token" \
--ipc=host \
lmsysorg/sglang:latest \
bash -c "pip3 install bitsandbytes" \
bash -c "python3 -m sglang.launch_server --model-path unsloth/Llama-3.1-Nemotron-70B-Instruct-bnb-4bit --host 0.0.0.0 --port 8000 --load-format safetensors --quantization bitsandbytes"
I tried to replace the model path with a folder path but still having the same error:
Successfully installed bitsandbytes-0.44.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[2024-11-05 17:48:08] server_args=ServerArgs(model_path='/root/.cache/huggingface/hub/models--unsloth--Llama-3.1-Nemotron-70B-Instruct-bnb-4bit/snapshots/8657c9f1ecf5a7aa33d862a1fdcd9d67887d87e4', tokenizer_path='/root/.cache/huggingface/hub/models--unsloth--Llama-3.1-Nemotron-70B-Instruct-bnb-4bit/snapshots/8657c9f1ecf5a7aa33d862a1fdcd9d67887d87e4', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='safetensors', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization='bitsandbytes', context_length=None, device='cuda', served_model_name='/root/.cache/huggingface/hub/models--unsloth--Llama-3.1-Nemotron-70B-Instruct-bnb-4bit/snapshots/8657c9f1ecf5a7aa33d862a1fdcd9d67887d87e4', chat_template=None, is_embedding=False, host='0.0.0.0', port=8000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=842229254, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
[2024-11-05 17:48:15 TP0] Init torch distributed begin.
[2024-11-05 17:48:16 TP0] Load weight begin. avail mem=78.84 GB
[2024-11-05 17:48:16 TP0] lm_eval is not installed, GPTQ may not be usable
Loading safetensors checkpoint shards: 0% Completed | 0/8 [00:00<?, ?it/s]
[2024-11-05 17:48:16 TP0] Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1191, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 163, in __init__
self.tp_worker = TpWorkerClass(
File "/sgl-workspace/sglang/python/sglang/srt/managers/tp_worker.py", line 55, in __init__
self.model_runner = ModelRunner(
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 149, in __init__
self.load_model()
File "/sgl-workspace/sglang/python/sglang/srt/model_executor/model_runner.py", line 253, in load_model
self.model = get_model(
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
return loader.load_model(model_config=model_config,
File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 402, in load_model
model.load_weights(self._get_all_weights(model_config, model))
File "/sgl-workspace/sglang/python/sglang/srt/models/llama.py", line 411, in load_weights
param = params_dict[name]
KeyError: 'model.layers.40.mlp.down_proj.weight'
W1105 17:48:16.881000 139936802006784 torch/_inductor/compile_worker/subproc_pool.py:126] SubprocPool unclean exit
The code is runing on a VM with an A100 GPU
Upvotes: 0
Views: 192