vLLM Parameters

We all know that vLLM is fast and easy to use library for LLM inference and serving. We shall go through some parameters tuning to get better performance out of vLLM.

The vLLM Engine parameter we shall discuss are:

-- max-num-batched-tokens

-- max-model-len

-- gpu-memory-utilization

-- enable-prefix-caching

-- enable-chunked-prefill

-- enforce-eager

max-model-len:

TL;DR as per your max token usage(input+output)

- By default the max model length is the max context length of the model you are using eg. for llama 3 8B instruct model it would be 8192

- If you can determine the max context length for your use case and if it is less than the max context length of your model, it is better to set this parameter to that value.

- Along with preventing out of memory error while loading or using the model it will also help to set other parameters like max-num-batched-tokens and gpu-memory-utilization.

- The value includes both input and output token counts, so if your use case never cross total token count of say 2046 tokens then set this parameter to that value.

max-num-batched-tokens:

TL;DR try values starting from max-model-len then try max-model-len*2 then *3 ...till you get out of memory error or preemption warning. Select value which gives good performance / If using enable-chunked-prefill then start from smaller value like 128 and try different values. Refer below for understanding.

- Note that this parameter is no. of tokens and not batch size.

- every request has to go through the prefill stage (KV value computation of input tokens and allotting them to KV cache blocks) and then a decode stage where recursive forward computation takes place to get tokens one by one.

- So the minimum value for this parameter is equal to max-model-len you have set in order to perform prefill (unless you use chunked-prefill, more on this later).

- more batch tokens value will allow for more prefill and so the first token will be generated more quickly. The vLLM scheduler by default prefers prefill.

- So a large batch size can help to get higher throughput but this does not work in all cases as the throughput and latency also depends on the GPU's compute-memory tradeoffs(more on this in the chunked-prefill section) and generally the prefill operation requires higher compute so it quickly saturates the max-batched-tokens we can keep..

- So it is better to start from the max-model-len value and try doubling it and check how performance changes till you hit out of memory error or get preemption warning as stated here to decide on the right batched-tokens. Then try to apply the theory to understand the working of your setup.

enable-prefix-caching:

TL;DR enable it unless you are using enable-chunked-prefill

- It caches the computed KV values for future use so you save that much time.

- If large portion of your prompt remains same across all the inputs then you will see better improvement in performance compared to a setup where prompt has very low portion of the total tokens unchanging.

- If using enable-chunked-prefill then this parameter is not allowed (as of the date of this post)

gpu-memory-utilization:

TL;DR more the better unless you get out of memory error. Default is 0.9

- more the GPU memory allotted to storing KV cache better it is for the performance.

- Reduce this value if you are facing out of memory error or if the GPU will share some other workload which necessitate reducing this value.

enforce-eager:

TL;DR if short on memory, enabled this

- When used, it will not construct the CUDA graph which may have some effects on performance but it will save memory which the graph would have taken.

- For some setup the CUDA graph will not significantly add to performance and the extra memory saved can be useful for increasing the value of above parameters like gpu-memory-utilization, max-num-batched-tokens etc.

enable-chunked-prefill:

TL;DR try enabling this and using batch-tokens starting from 128 to continuously increasing it. compare it with a setup without this parameter enabled. Refer below for more info.

- The decode stage is less compute intensive compare to prefill (vector to matrix multiplication vs matrix to matrix) but more memory bound as it has to read the computed KV values.

- refer to this paper if you want to explore more (https://arxiv.org/pdf/2308.16369) and this vLLM page (https://docs.vllm.ai/en/latest/models/performance.html)

- As stated before by default vLLM scheduler gives preference to prefill which is compute bound and during decode the memory acts as a bottleneck to fully utilize the low compute-high memory bound decode as reading of KV values is required.

- Chunk prefill chunks the prefill stage which allows for lower batch token number and makes a batch first by gathering decode requests and adding a chunked prefill request so that GPU's resources are efficiently utilized by mixing both compute intensive chunked prefill and more or enough numbers of low compute intensive but high memory bound decode requests.

- Again try by enabling it and using a small max-num-batched-tokens like 128 and increasing it till you get your required throughput and latency numbers.

- This method can help to improve the inter token latency as decode is prioritize due to small batch (reducing memory bottleneck) but can impact throughput so trying different batch size is necessary and also compare the results when this parameter is not used.

Example:

- we will use the vLLM entrypoints api instead of the openai compatible server

- to run the server without chunked prefill one example can be:

python3 -m vllm.entrypoints.api_server --model MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ --gpu-memory-utilization 0.95 --port 5000 --dtype half --enforce-eager --max-model-len 4096 --max-num-batched-tokens 8192 --enable-prefix-caching

- with chunked prefill enabled the command can be:

- send your request to the above server endpoint to check the performance for various parameters.

- Benchmark using the scripts provided by vLLM and i will too try to write a similar script and share here.

Thank you for reading!!

my github

Search This Blog

ai92

vLLM Parameter Tuning for Better Performance

vLLM Parameters

max-model-len:

max-num-batched-tokens:

enable-prefix-caching:

gpu-memory-utilization:

enforce-eager:

enable-chunked-prefill:

Example:

Comments

Post a Comment

Popular posts from this blog

LLM Web Scraping - Webpage to LLM Friendly Text - Fully Open Source

Stable Diffusion 3 on Colab (Run the Full model without quantization)