Qwen3.6-35B-A3B: เลือก parameters ตาม use case

June 16, 2026 · 6 min read

Author

สารบัญ

Context
TL;DR
Default server config บน DGX (ปัจจุบัน)
Qwen's own eval notes (จาก HF model card)
Best Practices จาก HF model card
Per-use-case recommendations
1. Coding agent
2. General agent (multi-step tool use)
3. STEM / Reasoning
4. Knowledge Q&A
Decision tree
แผนที่จะใช้
Caveats
วิธี override ใน client
Conclusion
References

Context

Qwen3.6-35B-A3B รันอยู่บน DGX Spark (port 8001, vLLM v0.23.0, 128K context)

ที่ผ่านมาใช้ temperature 0.6 ตาม --override-generation-config ของ recipe แต่พออ่าน HF model card ละเอียด ๆ เจอว่า Qwen team เองใช้ค่า ต่างกัน ในแต่ละ benchmark category

เลยรวบรวมมาเป็น note สั้น ๆ เพื่อใช้อ้างอิง

TL;DR

Qwen3.6-35B-A3B มี sampling parameters ที่แตกต่างกันระหว่าง "eval notes" (สำหรับรายงาน benchmark) กับ "Best Practices" (สำหรับใช้งานจริง) โดยเฉพาะ temperature: eval notes ใช้ temp=1.0 กับ agentic coding (SWE-bench, Terminal-Bench) แต่ Best Practices แนะนำ temp=0.6 สำหรับ precise coding (WebDev) และ temp=1.0 สำหรับ general thinking tasks บทความนี้สรุปทั้งสองมุมมอง พร้อมแผนปรับใช้บน DGX Spark (vLLM 0.23.0, 128K context)

Default server config บน DGX (ปัจจุบัน)

max_model_len: 131072
temperature: 0.6
top_p: 0.95
top_k: 20
min_p: 0.0
enable_thinking: true        # จาก default-chat-template-kwargs
preserve_thinking: true
reasoning_parser: qwen3
speculative: mtp k=2

หมายเหตุ: maxOutputTokens เป็น client-side parameter (max_tokens ใน API request) — ไม่มี server default ใน vLLM นอกจากนี้ presence_penalty และ repetition_penalty ไม่ได้ตั้งใน server config ปัจจุบัน แต่ Best Practices ของ Qwen แนะนำให้ใช้ (ดูด้านล่าง)

Qwen's own eval notes (จาก HF model card)

Qwen team ระบุ sampling settings ที่ใช้ในแต่ละ benchmark ไว้ใน evaluation notes:

Benchmark	temperature	top_p	top_k	max_tokens	Context
SWE-bench series	1.0	0.95	-	-	200K
Terminal-Bench 2.0	1.0	0.95	20	80K	256K
QwenClawBench (general agent)	0.6	-	-	-	256K
NL2Repo (via Claude Code)	1.0	0.95	-	-	max_turns=900

Pattern ที่เห็นจาก eval notes:

Agentic coding (SWE-bench, Terminal-Bench) → temperature 1.0
General agent (QwenClawBench) → temperature 0.6

สำคัญ: eval notes คือ setting ที่ใช้รายงาน benchmark ไม่ใช่ recommendation สำหรับใช้งานทั่วไป ดู Best Practices ด้านล่างสำหรับค่าที่แนะนำจริง

Best Practices จาก HF model card

Qwen team แบ่ง recommendation ตาม mode/task type:

Mode/Task	temperature	top_p	top_k	presence_penalty	repetition_penalty
Thinking mode — general tasks	1.0	0.95	20	1.5	1.0
Thinking mode — precise coding (เช่น WebDev)	0.6	0.95	20	0.0	1.0
Instruct / non-thinking mode	0.7	0.80	20	1.5	1.0

หมายเหตุ: presence_penalty ปรับได้ตั้งแต่ 0 ถึง 2 เพื่อลดการวนซ้ำ แต่ค่าสูงอาจทำให้เกิด language mixing และลด performance เล็กน้อย

Output length ที่แนะนำ:

ทั่วไป: 32,768 tokens
โจทย์ซับซ้อน (math, programming competition): 81,920 tokens

Per-use-case recommendations

1. Coding agent

Param	Value	Note
`temperature`	1.0	agentic coding (multi-step)
`top_p`	0.95	default
`presence_penalty`	1.5	ลดการวนซ้ำใน long session
`max_tokens`	16K-80K	ขึ้นอยู่กับ task
`enable_thinking`	true	reasoning ช่วย code quality
Context	128K+	repo-level tasks ต้องการ

Benchmarks: SWE-bench Verified 73.4, SWE-bench Pro 49.5, Terminal-Bench 51.5, Claw-Eval 68.7 (สูงสุดใน class)

สำหรับ precise coding (เช่น single-shot WebDev) Best Practices แนะนำ temperature=0.6, presence_penalty=0.0

2. General agent (multi-step tool use)

Param	Value	Note
`temperature`	0.6	ตาม QwenClawBench eval
`top_p`	0.95	default
`presence_penalty`	1.5	ลดการวนซ้ำ
`max_tokens`	4K-8K	พอสำหรับ tool trace
`enable_thinking`	true	ช่วยวางแผน
Context	128K

Benchmarks: Widesearch 60.1, MCPMark 37.0, Tool Decathlon 26.9, QwenClawBench 52.6

3. STEM / Reasoning

Param	Value	Note
`temperature`	1.0	ตาม Best Practices (general thinking)
`top_p`	0.95
`presence_penalty`	1.5
`max_tokens`	8K-81K	โจทย์ง่าย 8K ซับซ้อน 81K
`enable_thinking`	true	จำเป็น
Context	128K

Benchmarks: AIME26 92.7, GPQA Diamond 86.0, HMMT Feb 26 83.6 (สูงสุดใน class)

4. Knowledge Q&A

Param	Value	Note
`temperature`	0.6-1.0	ลงต่ำถ้าต้องการคำตอบแม่นยำ
`top_p`	0.95
`presence_penalty`	1.5
`max_tokens`	2K-32K	ทั่วไป 2-4K, วิเคราะห์ลึก 32K
`enable_thinking`	true
Context	32K	ส่วนใหญ่พอ

Benchmarks: MMLU-Pro 85.2, C-Eval 90.0, SuperGPQA 64.7

Decision tree

แผนที่จะใช้

Workload	temperature	max_tokens	thinking	เหตุผล
bot4k backtest	0 (reproducibility)	3000	true	ตรรกะแม่นยำแบบ deterministic
bot4k live trading	0.2	3000	true	สุ่มเล็กน้อย
Hermes coding	1.0	8000-16000	true	ตาม Qwen agentic coding recommendation
Long doc analysis	0.6	4000	true	128K context เพียงพอ
Math/reasoning task	1.0	8000	true	AIME26 92.7

Caveats

Context length — HF card ระบุชัด: "Maintain at least 128K tokens to preserve thinking capabilities" → recipe 128K ของเราพอดี
128K vs native 262K — model รองรับ 262K แต่เราใช้ 128K (max_model_len ใน recipe) เพราะ memory constraint บน single Spark
max_num_seqs drop — 128K recipe ลดจาก 40 เหลือ 12 เพราะ KV cache ใหญ่ขึ้น
MTP k=2 — ทดสอบแล้ว stable บน short prompt (58-64 tok/s) แต่ long prompt (16K+) อาจต่างจาก k=1

วิธี override ใน client

vLLM server defaults เป็น temperature 0.6 แต่ client สามารถ override ต่อ request ได้

const result = await generateText({
  model,
  temperature: 1.0,           // override 0.6 → 1.0
  maxOutputTokens: 8000,       // ขึ้นอยู่กับ task
  providerOptions: {
    openai: {
      reasoningEffort: 'medium'  // ใช้ค่านี้ร่วมกับ temperature
    }
  },
  // ...
})

reasoning_effort เป็น vLLM 0.23.0 OpenAI API field — ต่างจาก Qwen eval notes ที่ใช้ temperature เป็นหลัก แต่ใช้ร่วมกันได้

Conclusion

Qwen3.6-35B-A3B ไม่มี "ค่าเดียวที่ใช้ได้ทุกอย่าง" แต่มี eval notes สำหรับ benchmark กับ Best Practices สำหรับใช้งานจริงที่ต่างกัน จุดที่ต้องจำ:

Agentic coding (SWE-bench, Terminal-Bench): eval ใช้ temp=1.0 — แต่ถ้าเป็น single-shot precise coding ให้ลงมา 0.6
General thinking tasks (STEM, reasoning): Best Practices แนะนำ temp=1.0 + presence_penalty=1.5
presence_penalty สำคัญ — ไม่ได้ตั้งใน server config ปัจจุบัน แต่ Best Practices แนะนำ 1.5 สำหรับทั่วไป และ 0.0 สำหรับ precise coding
Output length — 32K สำหรับทั่วไป 81K สำหรับโจทย์ซับซ้อน

การ override ผ่าน client ต่อ request เป็นวิธีที่ยืดหยุ่นที่สุด ไม่ต้อง restart server ทุกครั้งที่เปลี่ยน workload

References

HF model card — evaluation notes + Best Practices
Qwen3.6 blog
DGX config: ~/notes/llm-inference/vllm-config-best-value.md
vLLM 0.23.0 upgrade notes: ~/notes/llm-inference/vllm-current-status.txt

qwen3 llm vllm parameters dgx-spark

แชร์บทความ

Facebook X

☕

เนื้อหานี้มีประโยชน์ไหม? ช่วยสนับสนุนค่ากาแฟให้ผู้เขียนสักแก้ว

Buy Me a Coffee

สารบัญ

Context​

TL;DR​

Default server config บน DGX (ปัจจุบัน)​

Qwen's own eval notes (จาก HF model card)​

Best Practices จาก HF model card​

Per-use-case recommendations​

1. Coding agent​

2. General agent (multi-step tool use)​

3. STEM / Reasoning​

4. Knowledge Q&A​

Decision tree​

แผนที่จะใช้​

Caveats​

วิธี override ใน client​

Conclusion​

References​