How to do random sampling in TensorRT-LLM?

Question

An LLM was fine-tuned to generate news headlines.

The inference is done either by using vLLM or TensorRT-LLM frameworks. Whereas vLLM produces highly diverse records (headlines) within a batch and in different batches, TensorRT-LLM repeats the same record again and again.

Here are two scripts:

with vLLM:

model = LLM(model=base_model,tokenizer=adapter_dir,enable_lora=True,max_lora_rank=16)
lora_request = LoRARequest('headlines',1,adapter_dir)
inputs = [' '] * sample_size
sampling_params = SamplingParams(temperature=temperature,top_p=top_p,max_tokens=max_num_tokens)
output = model.generate(prompts=inputs,sampling_params=sampling_params,lora_request=lora_request, use_tqdm=False)
output_texts = [o.outputs[0].text for o in output]

Output examples:

{"headline":"Papa John Wants To Buy His Old Company Back, Courts All Over The World To Their Screen"} {"headline":"Quinn Quashes Donald Sterling Reboot, Then He Shows Up"} {"headline":"17 Crucial Lessons I Learned From ‘Avengers: Infinity War’"} {"headline":"Why Do The Eagles Want The Super Bowl To Go To Overtime?"} {"headline":"A Single Person Shouldn't Have To Endure Melania Trump's Tragic Teddy"} {"headline":"‘I Feel Really Happy For My Mother’: Prince William Reveals How He Reacted To Pippa Middleton's Wedding"} {"headline":"NRA Rips Corps Of Democratic Donors, Including New York Lawyers And TV Executive"} {"headline":"Tesco To Scrap Plastic Bags For Customers, And While That's Great It's Not Enough"} {"headline":"Dominican Pleads Guilty To Drug Conspiracy That Led To Driver's Death"}

with TensorRT-LLM:

plugin_config = PluginConfig() plugin_config.gemm_plugin = 'bfloat16' plugin_config.lora_plugin = 'bfloat16' lora_config = LoraConfig(lora_dir=[adapter_dir]) kvcache_config = KvCacheConfig(free_gpu_memory_fraction=0.95) build_config=BuildConfig(max_batch_size=max_batch_size,lora_config=lora_config,plugin_config=plugin_config,use_fused_mlp=True) model = LLM(model=base_model,tokenizer=adapter_dir,enable_lora=True,max_lora_rank=16,build_config=build_config,kv_cache_config=kvcache_config,) lora_request = LoRARequest('headlines',1,adapter_dir) inputs = [' '] * sample_size sampling_params = SamplingParams(temperature=temperature,top_p=top_p,max_tokens=max_num_tokens,add_special_tokens=False,end_id=2,pad_id=0) output = model.generate(inputs=inputs,sampling_params=sampling_params,lora_request=lora_request, use_tqdm=False) output_texts = [o.outputs[0].text for o in output]

Output examples:

{"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"} {"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"} {"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"} {"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"} {"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"} {"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"}

How to modify the generation part for TensorRT-LLM so it produces highly diverse (random) records similar to vLLM? It seems like it does not do random sampling, similar to vLLM or e.g. Huggingface transformers library with do_sample=True.

How to do random sampling in TensorRT-LLM?

Answers (0)

Related Questions