Virtual Models บน LiteLLM Proxy: 1 โมเดล 5 profiles ใช้ให้เหมาะกับงาน

June 16, 2026 · 10 min read

Author

สารบัญ

ผมเคยตั้งค่า vLLM มาหลายเดือนด้วยสูตรเดียวตลอด — temperature: 0.6, top_p: 0.95 แล้วก็ปล่อยให้ทุก client ไป override เอาเอง ไม่ว่าจะเป็น Claude Code, Hermes Agent หรือ OpenWebUI — ใช้ค่าเดียวกันหมด

ผมคิดว่า "โมเดลตัวเดียวกัน ก็ต้องตั้งค่าเหมือนกันสิ" — จนกระทั่งลองเปลี่ยน sampling ตาม use case แล้วเห็นว่า โมเดลตัวเดียวกัน + sampling ต่างกัน = พฤติกรรมต่างกันโดยสิ้นเชิง

TL;DR

โมเดลตัวเดียวกัน + sampling ต่างกัน = ผลลัพธ์ต่างกันโดยสิ้นเชิง
ใช้ LiteLLM alias สร้าง 5 profiles (coder, agent, chat, reasoning, longctx) ชี้ไปที่โมเดลเดียวกัน แต่ตั้งค่า sampling ต่างกัน
Client ไม่ต้องแก้ไขอะไร เปลี่ยน backend ได้ภายหลัง แถม debug ง่ายขึ้นเพราะเห็นชื่อ profile ชัดเจนใน log

Insight: โมเดลเดียว แต่คนละ "คน"

ลองดูตัวอย่างนี้:

Prompt เดียวกัน: "ตั้งชื่อ AI Trading Bot ที่ฟังดูล้ำสมัย"

Temperature	Output
0.3	`TradeBot`, `MarketBot`
1.0	`AlphaPulse`, `QuantumTrader`, `SignalForge`

นี่คือ โมเดลตัวเดียวกัน (Qwen3.6-35B-A3B) — แค่ปรับ temperature ต่างกัน ก็ตอบออกมา "เหมือนกับคนละคนกัน" แล้ว

ผมเริ่มสงสัยว่า — ถ้า Chatbot, Coding Assistant และ Agent ใช้ model ตัวเดิม ทำไมต้องตั้งค่าเหมือนกันล่ะ? แต่ละงานต้องการ "ลักษณะการตอบต่างกัน"

ทางออก: Virtual Models ผ่าน LiteLLM

ถ้า deploy Qwen3.6-35B-A3B ไว้ backend เดียว แทนที่จะให้ทุก client เรียกตรง ๆ ผมสร้าง alias ไว้ใน LiteLLM proxy:

ทุก alias ชี้ไปที่โมเดลตัวเดียวกัน แต่ override sampling parameters ให้เหมาะกับงาน

ข้อดี 3 ข้อ

1. แยกพฤติกรรมตามงาน

Client	ใช้
Claude Code	`qwen-coder` (deterministic)
Hermes Agent	`qwen-agent` (stable)
OpenWebUI	`qwen-chat` (creative)
Repo Analysis	`qwen-longctx` (faithful)

2. เปลี่ยน backend ได้ภายหลัง

วันนี้ backend คือ qwen3.6-35b-a3b วันหน้าอาจเปลี่ยนเป็น qwen4 หรือ deepseek — client ไม่ต้องแก้อะไรเลย

3. Debug ง่ายขึ้น

ดู LiteLLM log เห็นชัดเจนว่า qwen-agent กำลังถูกเรียก แทนที่จะเห็นแค่ qwen3.6-35b-a3b แล้วต้องนั่งเดาว่ามาจาก workflow ไหน

Sampling Parameters 101 (แบบสั้น)

ก่อนไปดู profiles ขอ recap อย่างรวดเร็วว่าแต่ละ parameter ทำอะไร:

Temperature

ควบคุมความสุ่ม — ต่ำ → ตอบแน่นอน / สูง → สำรวจทางเลือกหลากหลาย

Top-P (Nucleus Sampling)

เลือก token จากกลุ่มความน่าจะเป็นรวม — 0.8 = ระมัดระวัง / 0.95 = สำรวจได้กว้าง

Top-K

จำกัดจำนวน token ที่พิจารณา — Qwen แนะนำ top_k=20 สำหรับทุกโหมด

Presence Penalty

ลดโอกาสการใช้คำซ้ำที่เคยปรากฏ — เหมาะสำหรับ reasoning/brainstorming แต่ ไม่เหมาะกับการ coding เพราะโค้ดต้องใช้ identifier ซ้ำ (เช่น user_id, await, async)

Repetition Penalty

Qwen แนะนำ 1.0 — แปลว่าทีม Qwen มองว่าโมเดลจัดการเรื่องนี้ได้ดีอยู่แล้ว

The 5 Profiles

Profile 1: `qwen-reasoning`

เหมาะกับ: วิเคราะห์ระบบ, Architecture Review, Research, Planning, MCP Agent Reasoning

{
  "temperature": 1.0,
  "top_p": 0.95,
  "top_k": 20,
  "presence_penalty": 1.5,
  "repetition_penalty": 1.0,
  "max_tokens": 16384
}

เหตุผล: Qwen แนะนำค่านี้โดยตรงสำหรับ thinking mode ทั่วไป — คิดกว้างขวาง สำรวจหลายแนวทาง แต่อาจไม่ deterministic มากนัก

Profile 2: `qwen-coder`

เหมาะกับ: Python, TypeScript, SQL, FastAPI, React, Docker, Kubernetes, MCP Server Development

{
  "temperature": 0.6,
  "top_p": 0.95,
  "top_k": 20,
  "presence_penalty": 0.0,
  "repetition_penalty": 1.0,
  "max_tokens": 16384
}

เหตุผล:

Coding ต้องการความถูกต้อง (Correctness) สำคัญกว่าความคิดสร้างสรรค์ (Creativity) — ลด temperature ช่วยให้โมเดลเลือกรูปแบบที่มั่นใจ
presence_penalty = 0 เพราะโค้ดต้องใช้ identifier ซ้ำ ถ้าตั้งสูงโมเดลจะหลีกเลี่ยง token เดิมจนเพี้ยน

Profile 3: `qwen-agent`

เหมาะกับ: MCP, Function Calling, Tool Calling, Multi-Step Agent

{
  "temperature": 0.3,
  "top_p": 0.9,
  "top_k": 20,
  "presence_penalty": 0.0,
  "repetition_penalty": 1.0,
  "max_tokens": 8192
}

เหตุผล: Agent ต่างจาก Chat — ต้องการความถูกต้องแน่นอน (Deterministic) สำคัญกว่าความคิดสร้างสรรค์ (Creative) เพราะ workflow เป็นแบบ Search → Analyze → Call Tool → Summarize ไม่ต้องการให้โมเดลคิด workflow ใหม่ทุกครั้ง

Note: ค่า temperature: 0.3 ไม่ได้อยู่ใน model card ของ Qwen โดยตรง — เป็น hypothesis จากผู้เขียนว่า Agent ควรตั้งค่าต่ำกว่า Coding profile เพื่อความเสถียรสูงสุด

Profile 4: `qwen-chat`

เหมาะกับ: Chatbot, FAQ, Assistant, Support

{
  "temperature": 0.7,
  "top_p": 0.8,
  "top_k": 20,
  "presence_penalty": 1.5,
  "repetition_penalty": 1.0,
  "max_tokens": 4096
}

เหตุผล: ใกล้เคียง Instruct Mode ที่ Qwen แนะนำ — ตอบเร็ว ตรงประเด็น ไม่คิดยาวเกินไป

Profile 5: `qwen-longctx`

เหมาะกับ: Codebase Analysis, Repository Review, Log Analysis, Document Analysis, RAG

{
  "temperature": 0.7,
  "top_p": 0.9,
  "top_k": 20,
  "presence_penalty": 0.5,
  "repetition_penalty": 1.0,
  "max_tokens": 16384
}

Note: ค่า temperature: 0.7 และ presence_penalty: 0.5 สำหรับ long context ไม่ได้อยู่ใน model card ของ Qwen โดยตรง — เป็น hypothesis จากประสบการณ์ใช้งานว่า context ยาวต้องการ faithfulness สูง จึงควรตั้งค่าระมัดระวังกว่า thinking mode ทั่วไป

Long Context ใช้ Temperature 0.7 ไม่ใช่ 1.0

นี่เป็นจุดที่ผมเข้าใจผิดมานาน — คิดว่า Reasoning = Temperature สูง เสมอไป

แต่เมื่อ context ใหญ่มาก (50K-120K tokens) โจทย์เปลี่ยน:

สิ่งที่ต้องการ	ค่า
Faithfulness ต่อ context	สำคัญที่สุด
Creativity	ไม่จำเป็น

ถ้า temperature สูงเกินไป บวกกับ context ยาว โมเดลมีแนวโน้มจะ:

หลุดจากข้อมูลต้นฉบับ
ข้ามรายละเอียดสำคัญ
สร้างสมมติฐานเพิ่มเอง

ดังนั้น long context analysis ทำงานได้ดีกว่าที่ temp 0.6-0.8 มากกว่า temp 1.0+

Note: สิ่งที่ผมเพิ่งสังเกต — Qwen เอง test Terminal-Bench (256K ctx) ด้วย temp=1.0 แต่ context ใหญ่เกินไป ใช้ temp ต่ำน่าจะเสถียรกว่า ยังต้องทดสอบเพิ่มเติม

Context Window 128K ควรตั้ง max_tokens เท่าไหร่

หลายคนเห็น Context Window = 128K แล้วตั้ง max_tokens = 128000 ทันที — ซึ่งโดยทั่วไปไม่จำเป็น

Qwen แนะนำ output length 32,768 tokens สำหรับ query ทั่วไป และ 81,920 tokens สำหรับปัญหาซับซ้อน เช่น math หรือ programming competitions แต่ใน production ตั้งค่าตามลักษณะงานจริง ๆ ดีกว่า

แนวทางที่ใช้:

Profile	max_tokens	เหตุผล
`qwen-chat`	4096	ตอบสั้น ไม่คิดยาว
`qwen-agent`	8192	tool calls + reasoning steps
`qwen-coder`	16384	code blocks + thinking
`qwen-reasoning`	16384	deep analysis
`qwen-longctx`	16384-32768	long doc summary

ประโยชน์ของการตั้ง max_tokens จริง ๆ:

ลด latency
ลด memory usage
ลด KV cache consumption
เพิ่ม throughput (โดยเฉพาะบน vLLM/SGLang)

Decision tree: เลือก profile ไหนดี

ถ้าไม่อยากสลับ profile บ่อย ๆ

สำหรับงานที่หลากหลาย (Coding + MCP + Agent + Infra + วิเคราะห์ระบบ) ผมเลือก middle-ground config ตัวเดียวที่ใช้ได้ดีในหลาย ๆ สถานการณ์ โดยไม่ต้องสลับ profile:

{
  "model": "qwen3.6-35b-base",
  "reasoning_effort": "medium",
  "max_tokens": 16384,
  "temperature": 0.7,
  "top_p": 0.9,
  "presence_penalty": 0.5,
  "repetition_penalty": 1.0,
  "extra_body": {
    "top_k": 20,
    "chat_template_kwargs": {
      "preserve_thinking": true
    }
  }
}

ทำไมค่านี้:

Param	Value	เหตุผล
`temperature: 0.7`	ระหว่าง 0.6 (coder) กับ 1.0 (reasoning)	ไม่เกิดค่าสุดโต่งเกินไปทั้งสองด้าน
`top_p: 0.9`	ต่ำกว่า reasoning (0.95)	ลดการสำรวจนิดหน่อย ให้อยู่ในกรอบ
`presence_penalty: 0.5`	ระหว่าง 0 (coder) กับ 1.5 (reasoning)	ไม่บังคับให้ใช้ identifier ซ้ำ แต่ลดการทำซ้ำ
`max_tokens: 16384`	ใหญ่พอ	รองรับทั้ง thinking + response
`preserve_thinking: true`	เปิด	เก็บ reasoning context ไว้

นี่ถือเป็น จุดกลาง ที่ใช้ได้ทั้ง Coding, Agent และการวิเคราะห์เอกสาร ไม่ต้องสลับ profile บ่อย ๆ

Note: ข้อแลกเปลี่ยน — ค่านี้จะไม่ optimize เต็มที่สำหรับ task ใด task หนึ่ง เช่น coding แทนที่จะเป็น qwen-coder (temp 0.6) ก็จะเป็นอย่างนี้แทน — เสีย precision นิดหน่อย แต่แลกกับความสะดวกที่ไม่ต้องสลับ profile

ผมใช้ config นี้เป็น default ใน LiteLLM alias qwen-default ของตัวเอง แล้วค่อย override เป็น profile เฉพาะเมื่อ task ต้องการ optimization จริง ๆ

ข้อสำคัญของ Temperature 0 กับ Backtest

ถ้าใช้กับ trading bot ที่ต้องการให้ผลลัพธ์ reproducible — ใช้ temperature: 0 ใน backtest mode ผมเคยคิดว่า temp 0.2 เพียงพอ แต่จริง ๆ แล้ว backtest ที่ต้อง reproduce ผลลัพธ์แบบ bit-for-bit ต้องใช้ temp 0 เท่านั้น

สรุป

ตารางสรุป 5 Virtual Profiles:

Profile	Temperature	Top-P	Presence Penalty	Max Tokens	Use Case
`qwen-reasoning`	1.0	0.95	1.5	16,384	Research, Architecture Review, Planning
`qwen-coder`	0.6	0.95	0.0	16,384	Python, TypeScript, SQL, Docker, MCP Server
`qwen-agent`	0.3	0.9	0.0	8,192	Function Calling, Tool Calling, Multi-Step
`qwen-chat`	0.7	0.8	1.5	4,096	Chatbot, FAQ, Support
`qwen-longctx`	0.7	0.9	0.5	16,384-32,768	Codebase Analysis, RAG, Log Analysis

แม้จะใช้ backend model ตัวเดียวกัน แต่การปรับ sampling ให้เหมาะกับลักษณะงาน จะช่วยให้โมเดลแสดงศักยภาพได้ดีกว่าการใช้ค่า default เดียวกับทุก use case

ในหลายระบบ agent/coding assistant ที่ผมทดลอง พบว่าการแยก profile แบบนี้ให้ผลลัพธ์ที่แตกต่างกันอย่างเห็นได้ชัด ทั้งด้านความเสถียร ความแม่นยำ และประสบการณ์การใช้งาน โดย แทบไม่ต้องเพิ่มต้นทุน infrastructure เลย เพราะยังใช้โมเดลตัวเดิม เปลี่ยนแค่วิธีการเรียก

สิ่งที่ยังไม่ได้ทดสอบ

ยังไม่ได้วัดว่า profile ไหนดีกว่าจริง ในงานจริง — เป็นแค่ hypothesis จาก Qwen's recommendation + use case theory
LiteLLM alias routing แต่ละ profile → ใช้ upstream config คนละชุด — ยังไม่ได้ทดสอบว่า performance impact เป็นยังไง
Trading bot (bot4k) ตอนนี้ใช้ temp 0/0.2 — ยังไม่ได้ลอง qwen-reasoning profile (temp 1.0) เพื่อดูว่า signal quality เปลี่ยนไหม

What's Next

ถ้าพี่ๆ ที่อ่านอยู่ใช้ LiteLLM อยู่ ลองสร้าง 2-3 profiles จาก 5 ที่ผมแนะนำ แล้วลองเทียบดู
ถ้าเจอ profile ที่ work ดีกว่านี้ มาแชร์กันได้ที่ comments
บทความนี้เป็น สรุปจาก draft notes + experiment ของผมเอง — ไม่ใช่ best practice สำเร็จรูป ใช้วิจารณ์ประกอบด้วย

References

Qwen3.6-35B-A3B Model Card (Hugging Face) - sampling recommendations, eval benchmarks
vLLM Documentation — OpenAI-compatible API server, --served-model-name alias support
LiteLLM Proxy Documentation — per-model config overrides via litellm_params
Qwen Blog - Sampling Best Practices — temperature/top-p guidance per mode
NVIDIA DGX Spark User Guide — DGX Spark deployment and tuning

litellm qwen dgx-spark llm sampling virtual-model use-case

แชร์บทความ

Facebook X

☕

เนื้อหานี้มีประโยชน์ไหม? ช่วยสนับสนุนค่ากาแฟให้ผู้เขียนสักแก้ว

Buy Me a Coffee

สารบัญ

TL;DR​

Insight: โมเดลเดียว แต่คนละ "คน"​

ทางออก: Virtual Models ผ่าน LiteLLM​

ข้อดี 3 ข้อ​

Sampling Parameters 101 (แบบสั้น)​

Temperature​

Top-P (Nucleus Sampling)​

Top-K​

Presence Penalty​

Repetition Penalty​

The 5 Profiles​

Profile 1: qwen-reasoning​

Profile 2: qwen-coder​

Profile 3: qwen-agent​

Profile 4: qwen-chat​

Profile 5: qwen-longctx​

Long Context ใช้ Temperature 0.7 ไม่ใช่ 1.0​

Context Window 128K ควรตั้ง max_tokens เท่าไหร่​

Decision tree: เลือก profile ไหนดี​

ถ้าไม่อยากสลับ profile บ่อย ๆ​

ข้อสำคัญของ Temperature 0 กับ Backtest​

สรุป​

สิ่งที่ยังไม่ได้ทดสอบ​

What's Next​

References​