Running Qwen3-Coder with vLLM and configuring VSCode to use Continue for code completion

Getting Started

The launch of Qwen3-Coder delivers a remarkably capable programming model, with the added bonus of a lightweight variant—Qwen3-Coder-Flash (Qwen3-Coder-30B-A3B-Instruct-FP8)—designed to run on consumer-grade hardware.

Notably, it retains FIM (Fill-in-the-Middle) support like its predecessor Qwen2.5-Coder while adding tool calling capabilities, enabling a single model to function as a chatbot, AI agent, and code completion tool.

Deploying with vLLM

My decision to use vLLM for deploying Qwen3-Coder-30B-A3B-Instruct-FP8—rather than Ollama—was primarily driven by performance. As for why I didn’t opt for SGLang? At the time of writing, SGLang still had unresolved tool calling issues with this model.

Installation steps are omitted here (refer to the official vLLM docs). On my RTX 4090 (48GB VRAM), the model handles ~256K FP8 contexts at 90% VRAM utilization.

For conservative operation, I configured 200K context length at 85% VRAM usage. Adjust --gpu-memory-utilization and --max-model-len for longer contexts while tuning --max-num-batched-tokens accordingly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
VLLM_ATTENTION_BACKEND=FLASHINFER \
vllm serve ~/models/Qwen3-Coder-30B-A3B-Instruct-FP8 \
--served-model-name qwen3-coder-flash \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--max-model-len 200K \
--max-seq-len-to-capture 200000 \
--max-num-batched-tokens 16K \
--max-num-seqs 64 \
--model-impl auto \
--gpu-memory-utilization 0.85 \
--kv-cache-dtype fp8_e4m3 \
--dtype auto \
--load-format auto \
--api-key sk-xxxx \
--port 30000 --host 0.0.0.0

Key parameters:

  • Model path: ~/models/Qwen3-Coder-30B-A3B-Instruct-FP8
  • Native 256K context support (extendable to 1M with YaRN)
  • FP8 KV Cache quantization (fp8_e4m3) reduces VRAM footprint
  • --max-seq-len-to-capture matches context length for optimal CUDA graph performance

Continue Configuration

Update Continue’s config as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
name: my-configuration
version: 0.0.1
schema: v1
models:
  - name: Qwen3
    provider: openai
    model: qwen3-coder-flash
    apiBase: https://localhost:30000/v1
    apiKey: sk-xxxx
    defaultCompletionOptions:
      contextLength: 128000
      temperature: 0.6
      maxTokens: 1024
    roles:
      - chat
      - edit
      - autocomplete
      - apply
    capabilities:
        - tool_use
    promptTemplates: 
      autocomplete: |
        `<|im_start|>system
        You are a code completion assistant.<|im_end|>
        <|im_start|>user
        <|fim_prefix|>{{{prefix}}}<|fim_suffix|>{{{suffix}}}<|fim_middle|><|im_end|>
        <|im_start|>assistant
        `

The autocomplete prompt template required special attention. Initially using Qwen2.5-Coder’s template yielded poor results. The correct approach involves wrapping the FIM tokens in a chat completion format:

1
2
3
4
5
# Correct message format for /v1/chat/completions
messages = [
    {"role": "system", "content": "You are a code completion assistant."},
    {"role": "user", "content": "<|fim_prefix|>...<|fim_middle|>"}
]

Since Continue sends templates directly to /v1/completions (suitable for base models), we manually adapted the chat template for our instruct-tuned model:

1
2
3
4
5
<|im_start|>system
You are a code completion assistant.<|im_end|>
<|im_start|>user
<|fim_prefix|>{{{prefix}}}<|fim_suffix|>{{{suffix}}}<|fim_middle|><|im_end|>
<|im_start|>assistant

This implementation demonstrates excellent performance in practice.


Related Content

0%