Blog
All articles on AI engineering, LLM architecture, retrieval, and more.
When FP8 KV Cache Speeds Up Decode and When It Only Saves Memory
How FP8 KV cache affects real LLM serving latency: storage-only paths, fused dequantization, FP8 attention, decode break-even points, and calibration risk.
Why GQA Changes the KV Cache Bill Before Quantization
How grouped-query attention changes the KV cache formula before FP8 or INT4: MHA vs MQA vs GQA, decode bandwidth, and serving capacity math.
When Long Context Makes KV Cache Quantization Worth It: FP8, INT4, and Scale Budgets
How FP8 and INT4 KV cache quantization trade memory headroom for scale calibration, key/value asymmetry, residual windows, and long-context quality checks.
Why autonomous hacking agents dominate CTFs: search, exploit synthesis, and reward shaping
An advanced explanation of how search, exploit synthesis, reward shaping, and closed-loop interaction make autonomous hacking agents effective in CTF environments.
Why Enterprise Agent Platforms Are Converging on Supervisor Graphs Instead of Single Mega-Agents
A mechanism-level analysis of why enterprise agent systems are shifting toward supervisor graphs, specialist agents, and explicit control planes.
Why “Secure Local AI Computers” Change Agent Architecture: Capability Gating, Audit Trails, and Human-in-the-Loop Control Planes
A mechanism-level analysis of why secure local execution changes agent architecture, shifting trust from model outputs to runtime design and control planes.
When Chatbots Miscalibrate by User Type: What the MIT Study Really Shows
A mechanism-level reading of the MIT study, arguing that the deeper problem is group-conditional calibration failure rather than generic chatbot bias.
Anthropic Commits $100M to the Claude Partner Network
A source-based briefing on Anthropic’s $100M partner-network push and what it implies for AI ecosystem execution.
KV Cache Compression in Practice: FP8/INT4 Trade-offs, Paging, and Attention Accuracy Drift
A systems-level analysis of KV cache compression, paging behavior, and quality drift under FP8/INT4 serving regimes.
Claude Code’s Local Access Safety Mechanisms: Sandbox Modes, Command Controls, and Approval Gates
A technical breakdown of Claude Code permission modes, sandbox controls, approval gates, and operational risk boundaries.
GPT-5.4 Arrives: What Actually Changed for Builders
A source-based briefing on GPT-5.4 and adjacent Anthropic signals, focused on practical stack decisions for engineering teams.
Diagnosing Hallucinations with Attribution Traces and Retrieval Coverage Metrics
Build a trace-level evaluation stack that links wrong answers to missing context, weak reranking, or reasoning drift.
How Much Hardware Do You Really Need to Run OpenClaw?
A practical sizing guide for OpenClaw across laptops, Mac mini, and servers—from light automation to research and GPU-heavy workflows.
LoRA's Low-Rank Assumption: When It Holds, When It Breaks
An analysis of LoRA's low-rank hypothesis, approximation error bounds, diagnostics, and practical rank selection under distribution shift.
Why Does Chain-of-Thought Improve Model Inference Ability?
A formal analysis of how chain-of-thought prompting expands effective computation depth in transformers, with information-theoretic bounds and empirical evidence from reasoning benchmarks.
Why Is Vector Search So Fast? HNSW and IVF-PQ Explained With the Math
A walkthrough of approximate nearest neighbor search covering HNSW graphs, inverted file indexes, product quantization, and IVF-PQ with worked examples and memory analysis.
Why Tokenization Choices Quietly Shape Model Behavior
A technically rigorous comparison of BPE and Unigram tokenization with formalized algorithms, worked examples, and analysis of downstream effects on model behavior.
Attention in Practice: Visualizing Q/K/V and why scaling heads changes behavior
A walkthrough of scaled dot-product attention (Q/K/V), softmax temperature, and why increasing head count shifts attention statistics and behavior.
What Is MCP and How Does It Work?
A practical breakdown of the Model Context Protocol architecture, transport modes, and why it fixes the N-times-M integration problem for AI tools.
