Reputation: 21
An LLM was fine-tuned to generate news headlines.
The inference is done either by using vLLM or TensorRT-LLM frameworks. Whereas vLLM produces highly diverse records (headlines) within a batch and in different batches, TensorRT-LLM repeats the same record again and again.
Here are two scripts:
model = LLM(model=base_model,tokenizer=adapter_dir,enable_lora=True,max_lora_rank=16)
lora_request = LoRARequest('headlines',1,adapter_dir)
inputs = ['<s> '] * sample_size
sampling_params = SamplingParams(temperature=temperature,top_p=top_p,max_tokens=max_num_tokens)
output = model.generate(prompts=inputs,sampling_params=sampling_params,lora_request=lora_request, use_tqdm=False)
output_texts = [o.outputs[0].text for o in output]
Output examples:
{"headline":"Papa John Wants To Buy His Old Company Back, Courts All Over The World To Their Screen"}
{"headline":"Quinn Quashes Donald Sterling Reboot, Then He Shows Up"}
{"headline":"17 Crucial Lessons I Learned From ‘Avengers: Infinity War’"}
{"headline":"Why Do The Eagles Want The Super Bowl To Go To Overtime?"}
{"headline":"A Single Person Shouldn't Have To Endure Melania Trump's Tragic Teddy"}
{"headline":"‘I Feel Really Happy For My Mother’: Prince William Reveals How He Reacted To Pippa Middleton's Wedding"}
{"headline":"NRA Rips Corps Of Democratic Donors, Including New York Lawyers And TV Executive"}
{"headline":"Tesco To Scrap Plastic Bags For Customers, And While That's Great It's Not Enough"}
{"headline":"Dominican Pleads Guilty To Drug Conspiracy That Led To Driver's Death"}
plugin_config = PluginConfig()
plugin_config.gemm_plugin = 'bfloat16'
plugin_config.lora_plugin = 'bfloat16'
lora_config = LoraConfig(lora_dir=[adapter_dir])
kvcache_config = KvCacheConfig(free_gpu_memory_fraction=0.95)
build_config=BuildConfig(max_batch_size=max_batch_size,lora_config=lora_config,plugin_config=plugin_config,use_fused_mlp=True)
model = LLM(model=base_model,tokenizer=adapter_dir,enable_lora=True,max_lora_rank=16,build_config=build_config,kv_cache_config=kvcache_config,)
lora_request = LoRARequest('headlines',1,adapter_dir)
inputs = ['<s> '] * sample_size
sampling_params = SamplingParams(temperature=temperature,top_p=top_p,max_tokens=max_num_tokens,add_special_tokens=False,end_id=2,pad_id=0)
output = model.generate(inputs=inputs,sampling_params=sampling_params,lora_request=lora_request, use_tqdm=False)
output_texts = [o.outputs[0].text for o in output]
Output examples:
{"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"}
{"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"}
{"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"}
{"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"}
{"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"}
{"headline":"'House Of Cards' Cast Unveils First Look At New Faces — And Not A Kevin Spacey In Sight"}
How to modify the generation part for TensorRT-LLM so it produces highly diverse (random) records similar to vLLM? It seems like it does not do random sampling, similar to vLLM or e.g. Huggingface transformers library with do_sample=True
.
Upvotes: 0
Views: 44