Running Qwen3-Coder with vLLM and configuring VSCode to use Continue for code completion
Getting Started
The launch of Qwen3-Coder delivers a remarkably capable programming model, with the added bonus of a lightweight variant—Qwen3-Coder-Flash (Qwen3-Coder-30B-A3B-Instruct-FP8)—designed to run on consumer-grade hardware.
Notably, it retains FIM (Fill-in-the-Middle) support like its predecessor Qwen2.5-Coder while adding tool calling capabilities, enabling a single model to function as a chatbot, AI agent, and code completion tool.
Deploying with vLLM
My decision to use vLLM for deploying Qwen3-Coder-30B-A3B-Instruct-FP8—rather than Ollama—was primarily driven by performance. As for why I didn’t opt for SGLang? At the time of writing, SGLang still had unresolved tool calling issues with this model.
Installation steps are omitted here (refer to the official vLLM docs). On my RTX 4090 (48GB VRAM), the model handles ~256K FP8 contexts at 90% VRAM utilization.
For conservative operation, I configured 200K context length at 85% VRAM usage. Adjust --gpu-memory-utilization
and --max-model-len
for longer contexts while tuning --max-num-batched-tokens
accordingly.
|
|
Key parameters:
- Model path:
~/models/Qwen3-Coder-30B-A3B-Instruct-FP8
- Native 256K context support (extendable to 1M with YaRN)
- FP8 KV Cache quantization (
fp8_e4m3
) reduces VRAM footprint --max-seq-len-to-capture
matches context length for optimal CUDA graph performance
Continue Configuration
Update Continue’s config as follows:
|
|
The autocomplete prompt template required special attention. Initially using Qwen2.5-Coder’s template yielded poor results. The correct approach involves wrapping the FIM tokens in a chat completion format:
|
|
Since Continue sends templates directly to /v1/completions
(suitable for base models), we manually adapted the chat template for our instruct-tuned model:
|
|
This implementation demonstrates excellent performance in practice.
Related Content
- Fix `OutOfResources: Shared Memory` Error When Run Qwen3 MoE With SGLang on RTX 4090
- Common Terms, Concepts and Explanations of Large Language Models
- Deploying DeepSeek R1 Distill Series Models on RTX 4090 With Ollama and Optimization
- CSCI 1100 - Homework 8 - Bears, Berries, and Tourists Redux - Classes
- Choice an Ideal Quantization Type for Llama.cpp