How to serve a bitsandbytes model with SGLang

Question

Im trying to serve the model: unsloth/Llama-3.1-Nemotron-70B-Instruct-bnb-4bit. But I'm having an error and I don't really understand it at all.

This is my command code:

sudo docker run --gpus all \
    -p 8000:8000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=my_token" \
    --ipc=host \
    lmsysorg/sglang:latest \
    bash -c "pip3 install bitsandbytes" \ 
    bash -c "python3 -m sglang.launch_server --model-path unsloth/Llama-3.1-Nemotron-70B-Instruct-bnb-4bit --host 0.0.0.0 --port 8000 --load-format safetensors --quantization bitsandbytes"

I tried to replace the model path with a folder path but still having the same error:

Successfully installed bitsandbytes-0.44.1
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[2024-11-05 17:48:08] server_args=ServerArgs(model_path='/root/.cache/huggingface/hub/models--unsloth--Llama-3.1-Nemotron-70B-Instruct-bnb-4bit/snapshots/8657c9f1ecf5a7aa33d862a1fdcd9d67887d87e4', tokenizer_path='/root/.cache/huggingface/hub/models--unsloth--Llama-3.1-Nemotron-70B-Instruct-bnb-4bit/snapshots/8657c9f1ecf5a7aa33d862a1fdcd9d67887d87e4', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='safetensors', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization='bitsandbytes', context_length=None, device='cuda', served_model_name='/root/.cache/huggingface/hub/models--unsloth--Llama-3.1-Nemotron-70B-Instruct-bnb-4bit/snapshots/8657c9f1ecf5a7aa33d862a1fdcd9d67887d87e4', chat_template=None, is_embedding=False, host='0.0.0.0', port=8000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=842229254, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
[2024-11-05 17:48:15 TP0] Init torch distributed begin.
[2024-11-05 17:48:16 TP0] Load weight begin. avail mem=78.84 GB
[2024-11-05 17:48:16 TP0] lm_eval is not installed, GPTQ may not be usable
Loading safetensors checkpoint shards:   0% Completed | 0/8 [00:00


The code is runing on a VM with an A100 GPU

How to serve a bitsandbytes model with SGLang

Answers (0)

Related Questions