Weekly AI Buzz: Key Breakthroughs and Trends Shaping 2026
Dive into the latest AI developments from the past week, highlighting new models, innovative tools, prompting techniques, and emerging career paths.

TL;DR: NVIDIA's Nemotron 3 Super and Alibaba's Qwen 3.5 both activate roughly 10 billion parameters per token. In production, Nemotron 3 Super runs about 3x faster. Qwen 3.5 beats it by 16 points on SWE-bench Verified. The gap has almost nothing to do with model size — and a lot to do with architecture choices and software ecosystem maturity.
NVIDIA released Nemotron 3 Super on March 11, 2026 at GTC. Alibaba had released Qwen 3.5 in February. Both are open-weight, mixture-of-experts models that activate roughly 10 billion parameters per token. On a spec sheet, they occupy the same tier.
In production, one runs about 3x faster. The other repairs code 16 percentage points more accurately on SWE-bench Verified. They are not competing on the same axis — and picking the wrong one for your workload has real cost consequences.
Here's what the benchmarks actually say, why the throughput gap is smaller than NVIDIA's paper claims, and a concrete framework for choosing between them.
Nemotron 3 Super uses a three-way hybrid architecture: Mamba-2 SSM layers, standard Transformer attention layers, and a "LatentMoE" expert routing system. It is the first production model to interleave all three paradigms in a single forward pass. The SSM layers handle most sequence processing with O(n) linear complexity — vs the O(n²) of standard Transformer attention. At 64,000 output tokens, a typical length for a coding agent generating a large file, this difference compounds fast.
Two factors amplify the speed advantage further. Multi-Token Prediction (MTP) — a form of built-in speculative decoding — generates multiple tokens per forward pass, adding roughly 50% to raw generation speed. NVIDIA also trained the model in NVFP4 precision natively. On Blackwell B200 hardware, NVFP4 runs 4x faster than FP8 on H100. Stack all three advantages and you get NVIDIA's 7.5x benchmark figure. It's real — under those specific conditions.
Qwen 3.5's architecture is also theoretically efficient. It interleaves Gated DeltaNet linear attention layers with standard Transformer attention in a 3:1 ratio, plus 256 routed MoE experts per layer. Gated DeltaNet is linear-time in theory — but as of early 2026, it has no ONNX operator support. Deployments running outside native PyTorch on CUDA decompose each recurrent step into 15–20 primitive operations instead of a single fused kernel. In those configurations, throughput on the DeltaNet layers degrades by 10–50x. This is a software ecosystem gap, not a fundamental architectural weakness — but it's real today, and it's the primary structural reason Nemotron 3 Super is faster in mainstream deployments.
Here's a direct comparison across the benchmarks most relevant to production deployment decisions.
| Benchmark | Nemotron 3 Super (120B) | Qwen3.5-122B | Qwen3.5-397B |
|---|---|---|---|
| SWE-bench Verified | 60.47% | ~72–74% (not independently confirmed) | 76.4% |
| GPQA (with tools) | 82.70% | — | 88.4% |
| MMLU-Pro | 83.73% | — | 86%+ |
| HLE (Humanity's Last Exam) | 18.26% | — | 25.30% |
| RULER at 1M context | 91.75% | — | — |
| PinchBench (agentic orchestration) | 85.6% (top open model) | — | — |
| Throughput (real-world API) | ~458–484 tok/s | ~152 tok/s | — |
The SWE-bench gap is the most operationally significant number here. A 16-point difference on coding repair tasks translates to a meaningfully higher per-step failure rate. If you're running high-volume agentic pipelines where retries are cheap and fast, that failure rate becomes a cost model question. If you're running single-shot high-stakes evaluation — production code review, security triage — the accuracy gap matters more than the throughput advantage.
Where Nemotron 3 Super leads clearly: long-context performance and agentic orchestration. A 91.75% RULER score at 1 million tokens and an 85.6% PinchBench score represent the strongest results in the open-weight category for multi-step agentic workflows as of March 2026. The 1 million token native context window, built on Mamba-2's linear complexity, is a genuine hardware-efficient advantage for workflows that need it.
The HLE gap is worth flagging separately. Nemotron 3 Super scores 18.26% vs Qwen3.5-397B's 25.30% on Humanity's Last Exam, which tests general scientific breadth across domains. Denser architectures with broader training coverage tend to perform better here — consistent with the pattern across most general reasoning benchmarks.
NVIDIA's technical report claims 7.5x higher throughput than Qwen3.5-122B. The test configuration: 8,000 input tokens and 64,000 output tokens, on NVIDIA hardware running NVFP4 precision. Under those conditions — long outputs, NVIDIA Blackwell stack, native precision format — the advantage is real.
Third-party data from Artificial Analysis puts the production API gap at roughly 3x: Nemotron 3 Super delivers 458–484 tokens per second; Qwen3.5-122B delivers around 152 tokens per second on Alibaba's API. Still a significant margin — but the 7.5x figure represents peak optimized conditions. Most workloads with shorter output sequences will see something closer to 3x.
One deployment constraint worth noting on the Nemotron side: Mamba-2 SSM kernels also have limited third-party framework support in early 2026. ONNX operators for SSM layers don't exist yet. If your inference stack runs outside NVIDIA NIM containers or native PyTorch on CUDA, you lose the SSM kernel advantage — and you're left with the same software ecosystem friction that affects Qwen's DeltaNet layers. Both models have an 8x H100-80GB self-hosting floor. Neither is accessible on small GPU setups.
Qwen 3.5 is Apache 2.0 across all model sizes. You can modify it, deploy it commercially, build derivative models, and redistribute without attribution requirements.
Nemotron 3 Super uses the NVIDIA Nemotron Open Model License. It's commercially usable and royalty-free, but it carries attribution requirements and NVIDIA-specific safeguard clauses. Calling it "open-source" without qualification is inaccurate. For enterprise teams with legal review processes — particularly those building products that include model redistribution or derivative model training — this difference is non-trivial. Verify with your legal team before committing it to a production product.
The core question is whether you're optimizing for throughput in a high-volume agentic pipeline or for accuracy on individual high-stakes tasks. Here's a practical checklist.
Use Nemotron 3 Super when:
Use Qwen 3.5 when:
One nuance worth modeling before committing: in a pipeline where each agent step has a failure mode and you're running 1,000 steps per hour, Nemotron 3 Super's lower SWE-bench score means more retries. If each retry costs roughly the same as an additional inference step, the throughput advantage partially erodes. Calculate cost-per-successful-outcome for your actual workload, not just tokens-per-second.
Nemotron 3 Super is a poor fit when:
Qwen 3.5 is a poor fit when:
Both require a minimum of 8x H100-80GB to self-host. Nemotron 3 Super also supports NVIDIA NIM containers with FP8 weights on H100. Neither runs on small GPU setups — these are enterprise-class deployments.
No. Nemotron 3 Super is text-only as of March 2026. Qwen 3.5 supports image inputs natively across multiple model sizes in the family, including the 122B variant.
It's accurate under specific conditions: 64,000+ output tokens on Blackwell hardware under NVFP4 precision. Third-party API data puts the production gap at 3x–4x for typical workloads with standard output lengths. Significant, but not 7.5x in standard conditions.
Qwen 3.5 has a 256K native context window, extensible to roughly 1M. Nemotron 3 Super's 1M context is native and uses Mamba-2's linear complexity, making it more compute-efficient at very long contexts.
Qwen 3.5 scores higher on SWE-bench Verified — 76.4% vs 60.47%. For accuracy-first single-shot workflows, Qwen 3.5 is the stronger choice. For high-volume agents where inference cost per step is the constraint, Nemotron 3 Super may reduce overall cost depending on your retry tolerance and hardware stack.
Neither model is the clear winner. NVIDIA built Nemotron 3 Super to demonstrate what their hardware stack enables at scale — the throughput advantage is real on their infrastructure, and it's operationally meaningful for high-volume agentic workflows. Alibaba built Qwen 3.5 to run everywhere, with the strongest SWE-bench score at its parameter class, true Apache 2.0 licensing, and hardware-agnostic architecture.
The practical decision: if you're on NVIDIA-native infrastructure and running throughput-critical agentic pipelines where per-step accuracy in the 60% range is acceptable, Nemotron 3 Super will likely lower your inference cost. In most other cases, Qwen 3.5 is the stronger default. NVIDIA's technical blog covers the full Mamba-2 and MTP architecture design for teams that want to go deeper before committing.
Before committing to either, benchmark both on a sample of your actual workload. Measure cost-per-successful-outcome — not tokens-per-second. That number will tell you which model actually fits your pipeline.
Dive into the latest AI developments from the past week, highlighting new models, innovative tools, prompting techniques, and emerging career paths.
This week in AI: regulators tighten scrutiny on Grok, Gemini expands, GitHub doubles down on AI agents, OpenAI pushes deeper into healthcare & more
AI Model Benchmarking: What Claude Sonnet 4.6's Token Surge Reveals
EU Commission missed its February 2026 AI Act guidance deadline. EU Council now proposes pushing high-risk AI enforcement to December 2027. Only 8 of 27 member states have enforcement authorities in place.
Muck Rack's 2026 journalism survey found 82% of journalists use AI, up from 77%. But concern about unchecked AI rose 8 points to 26%. Here is what the numbers mean for editorial teams.
Z.ai’s GLM-5 scores 77.8% on SWE-bench Verified and 62.0 on BrowseComp, nearly doubling Claude Opus 4.5’s 37.0. First open-weights model above 50 on the Artificial Analysis Intelligence Index.
The News/Media Alliance signed a 50/50 AI licensing deal with Bria covering 2,200 publishers on enterprise RAG queries. The split sounds equitable. Bria controls the attribution algorithm.