Popcorn Movies and TV: Om Shree

GPT-5.5 vs Claude Opus 4.7 vs Gemini 3.1 Pro: The Frontier Model Showdown

Om Shree — Sat, 25 Apr 2026 03:38:59 +0000

Three flagship models. Three different labs. Three different bets on what production AI actually needs in 2026. GPT-5.5 dropped April 23, Opus 4.7 dropped April 16, and Gemini 3.1 Pro has been in developer preview since February 19. If you're building agents, coding tools, or any serious production workflow right now, you need to know exactly where each one wins — and where it doesn't.

This is the breakdown with no hedging.

The Problem With "Best Model" Claims

Every lab calls its flagship the best. The honest answer is that no single model wins across every workload in April 2026. The differentiation has shifted from raw intelligence to specificity: which model is best for your tasks, at your price point, on your infrastructure. The gap between these three models on most benchmarks is narrow enough that the wrong choice costs more in API spend and rework than the right choice saves in capability.

Here's how to actually read the comparison.

The Benchmark Map: Who Wins What

Agentic coding is the highest-stakes category right now, and the results are split.

On Terminal-Bench 2.0, GPT-5.5 achieves 82.7%, up from GPT-5.4's 75.1%. Claude Opus 4.7 sits at 69.4%. Gemini 3.1 Pro scores 54.2% on SWE-Bench Pro. GPT-5.5 wins Terminal-Bench decisively — this benchmark tests real command-line workflows, shell scripting, container orchestration, and tool chaining. If your agent lives in a terminal, this is the number that matters most.

But on SWE-Bench Pro — real GitHub issue resolution across Python, JavaScript, Java, and Go — the rankings flip. Opus 4.7 scores 64.3% on SWE-Bench Pro, leapfrogging both GPT-5.4 at 57.7% and Gemini at 54.2%. GPT-5.5's score of 58.6% puts it ahead of GPT-5.4 but still behind Opus 4.7 on this specific benchmark.

Tool use and MCP is Opus 4.7's clearest win. Opus 4.7 leads MCP-Atlas at 77.3%, ahead of GPT-5.4 at 68.1% and Gemini 3.1 Pro at 73.9%. MCP-Atlas measures complex, multi-turn tool-calling scenarios — the closest thing to a real production agent benchmark. For teams building orchestration agents that route across multiple tools in a single workflow, this result is the one to pay attention to.

Scientific reasoning (GPQA Diamond) is essentially a three-way tie. Opus 4.7 comes in at 94.2%, Gemini 3.1 Pro at 94.3%, and GPT-5.4 Pro at 94.4%. GPT-5.5 does not break this tie meaningfully. This benchmark is approaching saturation at the frontier — the differentiation is elsewhere.

Abstract reasoning (ARC-AGI-2) is Google's headline story. Gemini 3.1 Pro scored 77.1% on ARC-AGI-2, more than double Gemini 3 Pro's score of 31.1%. ARC-AGI-2 specifically tests novel pattern recognition that models cannot have memorized during training. Neither OpenAI nor Anthropic has published comparable scores here, which tells its own story.

Computer use is close but GPT-5.5 nudges ahead. GPT-5.5 achieves 78.7% on OSWorld-Verified, Opus 4.7 reaches 78.0%, both up from GPT-5.4's 75.0%. A 0.7-point gap in Opus 4.7's favor on the previous generation is now reversed — marginally.

Web search and browsing is GPT-5.5's other clear advantage. GPT-5.4 held a BrowseComp lead at 89.3% versus Opus 4.7's 79.3%. GPT-5.5 maintains this gap. If your agent needs to navigate the web reliably, OpenAI has the edge.

How Each Model Actually Works Differently

GPT-5.5 is a genuinely new foundation. It's the first fully retrained base model since GPT-4.5 — not a refinement of the GPT-5 architecture, but a model trained from scratch. That explains the Terminal-Bench jump. The model reasons about code execution differently at a fundamental level, not just incrementally better. It matches GPT-5.4's per-token latency while performing at a higher intelligence level — and uses fewer tokens to complete the same Codex tasks.

Claude Opus 4.7 introduced a behavioral shift that the benchmarks only partially capture. It devises ways to verify its own outputs before reporting back, catches its own logical faults during the planning phase, and accelerates execution far beyond previous Claude models. This isn't just a score improvement — it's a change in how the model approaches long-horizon agentic work. Low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6, which means the efficiency gain shows up in your token bill before you even tune effort levels. The vision upgrade also deserves mention: image resolution jumped from 1.15 megapixels to 3.75 megapixels — more than three times the pixel count of any prior Claude model.

Gemini 3.1 Pro plays a different game: multimodal breadth and context scale. It is the only frontier model with true native multimodal support — handling text, images, audio, and video simultaneously within a single unified model. GPT-5.5 handles text and images but not audio or video at the API level. Opus 4.7 has excellent vision but no audio or video. The context window is 2 million tokens — the largest of any frontier model available today. In practical terms, this means processing entire book collections, extensive legal contracts, or hours of video in a single prompt. GPT-5.5 and Opus 4.7 both offer 1M context windows, but Gemini doubles it.

What Developers Are Actually Using Each One For

GPT-5.5 in Codex is the default choice for infrastructure automation, CI/CD scripting, and multi-step computer use. The Terminal-Bench lead is real and it matters for DevOps-adjacent workflows. Cursor co-founder Michael Truell confirmed GPT-5.5 stayed on task longer and showed more reliable tool use than GPT-5.4. It's also the model to choose if your agent does significant web navigation.

Claude Opus 4.7 is the strongest choice for production coding agents that need to reason through ambiguous, multi-file engineering problems — and for any workflow that requires reliable tool orchestration. Vercel confirmed Opus 4.7 does proofs on systems code before starting work — a new behavior not seen in prior Claude models. For legal tech, financial analysis, and document-heavy enterprise work, the Finance Agent benchmark win (64.4%, state-of-the-art at release) and the BigLaw Bench result (90.9%) are concrete signals.

Gemini 3.1 Pro is the right choice when your workload is research-heavy, multimodal by nature, or involves very long context that would push the other models to their limits. It's also the only model in this group that can natively process video alongside text — useful for content pipelines, educational tooling, and media analysis.

The Pricing Table That Actually Matters

This is where the decision often gets made.

Gemini 3.1 Pro costs $2.00 per million input tokens and $12.00 per million output tokens.

Claude Opus 4.7 is priced at $5 per million input tokens and $25 per million output tokens — unchanged from Opus 4.6.

GPT-5.5 costs $5.00 per million input tokens and $30.00 per million output tokens.

At equivalent input pricing, Gemini 3.1 Pro costs 60% less than the other two flagships. At 10 million output tokens per month, Gemini comes in at roughly $120, Opus 4.7 at $250, and GPT-5.5 at $300. For high-volume workloads where Gemini's benchmark profile is sufficient, that gap is real budget.

One important caveat on Opus 4.7: the new tokenizer can use roughly 1.0–1.35x more tokens than Opus 4.6 depending on content. Replay real prompts before assuming the list price is your actual cost.

On GPT-5.5: cached input tokens drop to $0.50 per million — a tenth of the standard rate. Cache your system prompts and tool schemas on any multi-turn workflow.

Why This Three-Way Split Is a Bigger Deal Than It Looks

The 2024 playbook was: pick the smartest model, use it for everything. That playbook is dead.

The April 2026 frontier is differentiated enough that routing by task type is now the correct architecture. GPT-5.5 on terminal and browser tasks, Opus 4.7 on complex multi-file coding and tool orchestration, Gemini 3.1 Pro on research, video, and long-context analysis — that's not hedging, it's the optimal engineering decision given where benchmarks actually sit.

An IDC analyst framed the structural dynamic plainly: no single model wins everywhere, which is healthy for the ecosystem and gives developers real choices based on specific needs. The developers who treat model selection as a routing problem — rather than a loyalty problem — will ship better products at lower cost.

Access and Availability

GPT-5.5 is live in ChatGPT for Plus, Pro, Business, and Enterprise users. API access (gpt-5.5) is available now through OpenAI's platform at $5/$30 per million tokens.

Claude Opus 4.7 (claude-opus-4-7) is generally available via the Anthropic API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry at $5/$25 per million tokens.

Gemini 3.1 Pro is available in developer preview via Google AI Studio, Vertex AI, and Gemini CLI at $2/$12 per million tokens (under 200K context).

There is no universal winner in April 2026. There are three strong models with distinct profiles, real price differences, and specific workloads where each one is the right default. The engineers who benchmark their actual tasks against all three will build better systems than the ones who follow lab marketing. Start there.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Google Just Killed Vertex AI. Here's What the Gemini Enterprise Agent Platform

Om Shree — Sat, 25 Apr 2026 03:31:36 +0000

Vertex AI has been Google Cloud's AI development platform since 2021. On April 22, 2026, at Google Cloud Next in Las Vegas, Google retired it — not with a deprecation notice, but with a full rebrand and architectural overhaul. Going forward, all Vertex AI services and roadmap evolutions will be delivered exclusively through Agent Platform. If you're building on Google Cloud's AI stack, the ground just shifted.

The Problem It's Solving

Vertex AI was built for a different era. In the early days of generative AI, building safe and reliable business tools took massive engineering effort and a high tolerance for trial and error. Vertex handled that well — model selection, fine-tuning, deployment. But it was never designed for what enterprise AI has actually become: fleets of autonomous agents operating across dozens of systems simultaneously, often without proper security or governance guardrails.

The gap is real. You can build a capable agent today without much trouble. Governing it — knowing what it's doing, what it has access to, whether it's behaving as intended — is a different problem entirely. Anthropic has Managed Agents, which cover runtime and memory but leave governance and observability to third parties. Google is betting that owning that full stack is the differentiator.

How the Gemini Enterprise Agent Platform Actually Works

The platform is organized around four pillars: Build, Scale, Govern, and Optimize. Each maps to a concrete set of tools, not just marketing categories.

Build covers the development surface. Agent Studio provides a low-code visual canvas for designing, prototyping, and managing agent reasoning loops. The Agent Development Kit (ADK) handles code-first development of complex agents. Agent Garden gives developers a library of prebuilt agents and templates. And Model Garden provides access to over 200 foundation models — including Gemini 3.1 Pro, Gemma 4, and third-party models like Anthropic's Claude Opus, Sonnet, and Haiku.

A significant ADK upgrade ships with this release. More than six trillion tokens are processed monthly through ADK. The new graph-based framework lets you organize agents into a network of sub-agents, defining clear, reliable logic for how they collaborate on complex problems.

Scale is handled by Agent Runtime, which is rebuilt for a specific and important use case: long-running agents that maintain state for days at a time, backed by Memory Bank for persistent, long-term context. This is where Google draws a real line against stateless chat-based architectures. Payhawk is already using Memory Bank so their Financial Controller Agent recalls user habits and auto-submits expenses, cutting submission time by over 50%.

Govern is where this platform separates from everything else on the market. Three components do the work:

Agent Identity gives every agent a unique cryptographic ID, creating a clear auditable trail for every action it takes, mapped back to defined authorization policies. Think of it as IAM, but for agents rather than humans — SPIFFE-formatted, natively integrated.

Agent Registry provides a single source of truth for the enterprise: it indexes every internal agent, tool, and skill, ensuring only governed and approved assets are available to your users.

Agent Gateway acts as the air traffic control for your agent ecosystem — providing secure, unified connectivity between agents and tools across any environment, while enforcing consistent security policies and Model Armor protections to safeguard against prompt injection and data leakage.

Optimize closes the loop with Agent Simulation, Agent Evaluation, and Agent Observability. Multi-Turn AutoRaters and Online Evaluation for live traffic give systematic quality assessment. The Unified Trace Viewer provides detailed visibility into agent reasoning and performance for debugging.

What Teams Are Actually Using It For

The customer quotes in the announcement are more concrete than typical launch testimonials, which makes them worth citing.

Comcast rebuilt the Xfinity Assistant using ADK — moving from scripted automation to conversational, generative troubleshooting. Color Health built a Virtual Cancer Clinic that uses Agent Runtime to check screening eligibility, connect patients to clinicians, and schedule appointments at scale. L'Oréal is arguably the most technically interesting case: their Beauty Tech Agentic Platform uses ADK for agent orchestration, and connects agents to their data sources via Model Context Protocol (MCP), securely linked to their core operational applications.

PayPal is also live with Agent Payment Protocol (AP2), using it as the foundation for trusted agent-initiated payments. That's not a demo — that's commerce infrastructure.

More than 85% of OpenAI's workforce uses Codex every week was one of GPT-5.5's big enterprise claims. Google's equivalent signal here is six trillion tokens per month through ADK alone. The scale is real.

Why This Is a Bigger Deal Than It Looks

The headline is governance. Every serious enterprise blocker for production agentic AI comes back to the same questions: Who authorized this agent to do that? What did it actually do? Can we audit it? Can we revoke it? Until this week, the honest answer in almost every platform was "partially, with custom tooling."

An IDC analyst framed Google's actual differentiation clearly: "Google has entrenched hardware, developer tools to build and manage agents, and an end-user AI app in Gemini — no one else has those three. That full lifecycle is what they're really hoping differentiates them."

The MCP integration is also worth flagging for this audience specifically. Agent Gateway and Agent Registry natively support MCP servers — meaning any tool you've already built using the Model Context Protocol can be registered, governed, and exposed to agents through the same identity and policy system. That's a significant win for developers who've already built on MCP.

Developers currently building on Vertex AI keep working in the same console, but the product has a different name and incorporates components that did not exist before: runtimes for long-running agents, persistent memories, registries with cryptographic IDs, security gateways, and simulation tools. The migration surface is low. The capability delta is not.

Availability and Access

Announced at Google Cloud Next on April 22, 2026, the platform brings together the Gemini Enterprise app, the Gemini Enterprise Agent Platform, and a partner marketplace that lets companies deploy third-party agents from vendors including Oracle, Salesforce, ServiceNow, Adobe, and Workday inside the same governed environment.

You can access the platform directly at Agent Platform in the Google Cloud console. The ADK is available at docs.cloud.google.com. Full documentation for the governance layer — Agent Identity, Gateway, and Registry — is at the Agent Platform overview.

Google says the new Gemini Enterprise features will roll out over the coming months. Not everything is GA today — build your evaluation timeline accordingly.

The enterprise agentic AI race has moved past "which model is smartest" into "which platform can actually govern thousands of agents at once." Google just made the most complete argument yet that it has an answer. Whether the execution matches the architecture is what the next six months will show.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

OpenAI Just Released GPT-5.5. Here's What It Actually Does (and What It Costs You)

Om Shree — Sat, 25 Apr 2026 03:27:28 +0000

GPT-5.4 shipped on March 5. Seven weeks later, on April 23, 2026, OpenAI released GPT-5.5 — and the pace alone tells you something about where this race is headed. This isn't iteration for iteration's sake. GPT-5.5 is a genuinely different model from the ground up, and if you're building on top of OpenAI's stack, the changes matter in ways that go beyond the benchmark table.

Here's everything developers need to know.

The Problem It's Solving

The core complaint with every prior GPT-5.x model was the same: impressive on individual tasks, but brittle on anything that required sustained, multi-step reasoning. You'd hand it a complex task, get a decent first pass, and then spend the next hour managing every subsequent step yourself.

GPT-5.5 is designed to handle messy, multi-part tasks where you can trust it to plan, use tools, check its own work, navigate ambiguity, and keep going without babysitting. OpenAI That's the stated goal, and unlike most model launch claims, there's enough third-party benchmark data to take it seriously.

How GPT-5.5 Actually Works

The first thing to understand about GPT-5.5 is architectural. Every GPT model since GPT-5 — versions 5.1 through 5.4 — was built on the same base architecture. GPT-5.5 breaks that pattern entirely. It's a model trained from scratch. LushBinary That's not a minor detail. Fresh base training means the model reasons differently at a fundamental level, particularly in how it maintains context across long, multi-file, multi-step tasks.

GPT-5.5 ships in three variants: the standard model (GPT-5.5 Thinking), and a higher-compute version called GPT-5.5 Pro. The model supports text and image input and has a context window of approximately 920K tokens. Artificial Analysis In Codex specifically, GPT-5.5 can be accessed with a 400,000 token context window across Plus, Pro, Business, Enterprise, Edu, and Go plans. gHacks Tech News

GPT-5.5 matches GPT-5.4's per-token latency in real-world serving while performing at a significantly higher level of intelligence. It also uses fewer tokens to complete the same Codex tasks. OpenAI That last point matters for your cost model, which we'll get to.

On the research side, OpenAI has a concrete example worth noting. An internal version of GPT-5.5 with a custom harness helped discover a new proof about Ramsey numbers in combinatorics, later verified in Lean — a concrete case of GPT-5.5 contributing not just code or explanation, but a mathematically novel argument in a core research area. OpenAI

What Developers Are Actually Using It For

Agentic coding in Codex is the headline use case. The model is designed to handle engineering work such as implementation, refactoring, debugging, testing, and validation as a continuous loop. Developer Tech News

Real-world signals from early testers are notably specific. Dan Shipper, CEO of Every, said GPT-5.5 reproduced the type of system rewrite one of his engineers had eventually chosen for a post-launch issue, while GPT-5.4 could not. Pietro Schirano, CEO of MagicPath, said the model merged a branch with hundreds of frontend and refactor changes into a main codebase that had also diverged, resolving the work in about 20 minutes. Cursor co-founder Michael Truell noted GPT-5.5 stayed on task longer and showed more reliable tool use than GPT-5.4. Developer Tech News

Computer use is meaningfully better. On OSWorld-Verified, which assesses a model's ability to operate in real-world computer environments autonomously, GPT-5.5 achieves 78.7%, up from GPT-5.4's 75.0%. gHacks Tech News

Knowledge work across 44 occupations is tracked via GDPval. GPT-5.5 scores 84.9% on GDPval and 98.0% on Tau2-bench Telecom, which tests complex customer-service workflows, without prompt tuning. OpenAI

OpenAI also shared internal use cases: the Finance team used Codex to review 24,771 K-1 tax forms across 71,637 pages, helping accelerate the task by two weeks compared to the prior year. A Go-to-Market employee automated weekly business reporting, saving 5–10 hours per week. OpenAI

Codex + browser expansion is also new. With GPT-5.5, Codex can interact with web apps, test flows, click through pages, capture screenshots, and iterate on what it sees until it completes the task — expanding well beyond the terminal. 9to5Mac

What the Benchmarks Actually Show

OpenAI moved away from SWE-bench Verified as a primary eval, citing plateau concerns. The benchmarks now favored are more demanding and more representative of real work.

On Terminal-Bench 2.0, GPT-5.5 achieves 82.7%, up from GPT-5.4's 75.1%. Claude Opus 4.7 sits at 69.4%. gHacks Tech News Terminal-Bench tests real command-line workflows: multi-step shell scripting, package management, build configuration, container orchestration. A single wrong flag breaks the chain. This is the benchmark where GPT-5.5's lead is most decisive.

On SWE-Bench Pro, GPT-5.5 scores 58.6%. Claude Opus 4.7 scores higher at 64.3%. gHacks Tech News That's an honest trade-off OpenAI included in their own launch materials — a rare sign of benchmark confidence elsewhere even if not everywhere.

On CyberGym, GPT-5.5 scores 81.8%, versus GPT-5.4's 79.0% and Claude Opus 4.7's 73.1%. gHacks Tech News

On FrontierMath Tier 1–3, GPT-5.5 scores 51.7%, up from GPT-5.4's 47.6%. Skypage

One important caveat from third-party testing: in many benchmarks, GPT-5.4 Pro still outperforms the default GPT-5.5. The New Stack The Pro tier of the older model remains competitive unless you're specifically targeting the areas where the new architecture shines.

Why This Is a Bigger Deal Than It Looks

Two things make this release significant beyond the spec sheet.

First, the architecture break. Every GPT-5.x model up to 5.4 was a refinement of the same base. GPT-5.5 is not. GPT-5.5 (codenamed "Spud") is the first fully retrained base model since GPT-4.5. LushBinary That changes what's possible downstream. The previous models delivered steady improvements to Codex, but each was constrained by the original GPT-5 architecture. GPT-5.5 doesn't have that ceiling.

Second, the super app strategy. Greg Brockman said GPT-5.5 is another step toward a "super app" — a unified service combining ChatGPT, Codex, and an AI browser — that Brockman and Sam Altman envision as the primary interface for enterprise work. TechCrunch GPT-5.5 is both a model release and an infrastructure move. The cadence — GPT-5.4 on March 5, GPT-5.5 on April 23 — is deliberate. OpenAI is trying to establish category lock-in before enterprise procurement cycles close.

The NVIDIA integration is also notable. GPT-5.5 was co-designed, trained, and served on NVIDIA GB200 and GB300 NVL72 systems. Codex analyzed weeks of production traffic data and wrote custom heuristic algorithms for load balancing and partitioning, resulting in more than 20% faster token generation speeds. Developer Tech News The model helped optimize its own serving stack. That feedback loop between the model and the infrastructure it runs on is new.

The Part That Should Actually Concern You: Pricing

This is where the release gets complicated for independent developers and smaller teams.

GPT-5.5 API pricing: $5.00 per million input tokens, $30.00 per million output tokens. Apidog That's double GPT-5.4's input price of $2.50. GPT-5 launched in August 2025 at $0.63 per million input tokens. GPT-5.4 increased that to $2.50 in March 2026. GPT-5.5 doubles it again to $5.00 — nearly an 8x increase in under a year. Skypage

GPT-5.5 Pro pricing: $30 per million input tokens and $180 per million output tokens, with Priority processing at 2.5 times the standard rate. EdTech Innovation Hub

OpenAI's defense of this is token efficiency — the model reaches the same output with fewer tokens, so your actual bill may not double even if the rate does. At 10 million output tokens per month, GPT-5.5 standard comes to $300 versus Claude Opus 4.7's $250. If GPT-5.5's agentic performance means 25% fewer task iterations, you break even. Build Fast with AI The math works — if the efficiency gains hold for your specific workload. Benchmark your actual tasks before assuming the sticker price reflects your real cost.

One concrete optimization to implement immediately: cached input tokens on GPT-5.5 drop to $0.50 per million — a tenth of the standard rate. Cache system prompts, tool schemas, and repo context on anything reused across requests. Skypage

Availability and Access

GPT-5.5 is rolling out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex. GPT-5.5 Pro is rolling out to Pro, Business, and Enterprise users in ChatGPT. As of April 24, 2026, both GPT-5.5 and GPT-5.5 Pro are available in the API. OpenAI

For API access, the model IDs are gpt-5.5 for standard and gpt-5.5-pro for the Pro tier. Both are available through the Chat Completions and Responses APIs.

On the safety side, OpenAI has classified GPT-5.5's cybersecurity and biological capabilities as High under its Preparedness Framework, though below the Critical threshold. The company is also running a Trusted Access for Cyber program through Codex, allowing verified users expanded access to advanced security capabilities. EdTech Innovation Hub

Quick cost controls worth building in on day one: route premium, long-horizon tasks to GPT-5.5 and standard queries to GPT-5.4 or GPT-5.4-mini. The per-token price jump makes tiered routing a budget necessity, not an optimization.

The real story here isn't a single model release — it's the six-week cadence that produced it. OpenAI is shipping at a pace that forces enterprise decisions before anyone has time to fully evaluate. Whether that serves developers or just locks them in faster is a question the next few months will answer.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

DeepSeek Just Dropped V4. Here's What the Benchmarks Actually Tell You.

Om Shree — Fri, 24 Apr 2026 09:01:03 +0000

Open-source AI has spent two years being "almost there." With DeepSeek-V4-Pro, the gap with frontier closed-source models isn't almost closed — in some benchmarks, it's gone.

The Problem It's Solving

The standard narrative has been simple: closed-source models from OpenAI, Google, and Anthropic sit at the frontier. Open-source models follow, months behind, at a fraction of the cost but with a meaningful capability tax. You pay in quality for what you save in dollars.

DeepSeek-V4-Pro-Max — the maximum reasoning effort mode of DeepSeek-V4-Pro — is being positioned as the best open-source model available today, significantly advancing knowledge capabilities and bridging the gap with leading closed-source models on reasoning and agentic tasks. Hugging Face That's a bold claim. The benchmark data makes it harder to dismiss than the usual open-source PR.

How It Actually Works

DeepSeek-V4-Pro ships as a 1.6 trillion parameter Mixture-of-Experts model with 49 billion parameters activated per token, while DeepSeek-V4-Flash runs at 284 billion total with 13 billion activated. Both support a one million token context window. Hugging Face

The architecture is doing real work here, not just scaling. A hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) dramatically improves long-context efficiency — in the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. Hugging Face That's not a marginal improvement. That's a fundamentally different inference cost profile at scale.

Manifold-Constrained Hyper-Connections (mHC) strengthen residual connections across layers while preserving model expressivity Hugging Face , and the Muon optimizer handles training stability. This isn't DeepSeek iterating on V3 — it's a ground-up architectural rethink.

The reasoning modes matter for how you deploy. Both Pro and Flash support three effort levels: standard, high, and max. For Think Max reasoning mode, DeepSeek recommends setting the context window to at least 384K tokens. Hugging Face The Flash-Max mode is particularly interesting — Flash-Max achieves comparable reasoning performance to the Pro version when given a larger thinking budget, though its smaller parameter scale places it slightly behind on pure knowledge tasks and the most complex agentic workflows. Hugging Face

What Developers Are Actually Using It For

The benchmark table that Frank Fiegel at Glama flagged this morning tells the real story — specifically, the agentic and coding numbers.

On LiveCodeBench, V4-Pro leads the pack at 93.5, ahead of Gemini (91.7) and Claude (88.8). Codeforces rating — a real-world competitive programming measure — puts V4-Pro at 3206, ahead of GPT-5.4 (3168) and Gemini (3052). OfficeChai Competitive programming benchmarks are notoriously hard to game; this is the kind of number that makes engineers pay attention.

On SWE-Verified (real software engineering tasks), V4-Pro sits at 80.6 — within a fraction of Claude (80.8) and matching Gemini (80.6). On Terminal Bench 2.0, V4-Pro (67.9) beats Claude (65.4) and is competitive with Gemini (68.5), though GPT-5.4 leads at 75.1. OfficeChai

For math reasoning: on IMOAnswerBench, V4-Pro scores 89.8 — well ahead of Claude (75.3) and Gemini (81.0), though GPT-5.4 edges ahead at 91.4. OfficeChai The one clear gap is Humanity's Last Exam, where V4-Pro scores 37.7 — just below GPT-5.4 (39.8), Claude (40.0), and Gemini (44.4). OfficeChai Factual world knowledge retrieval is still where closed-source models hold a real edge.

DeepSeek says V4 has been optimized for use with popular agent tools including Claude Code and OpenClaw CNBC , which signals the team is building for production agentic deployment, not just benchmark positioning.

Why This Is a Bigger Deal Than It Looks

The capability story is interesting. The cost story is the one that matters for anyone running production workloads.

In comparison, OpenAI's GPT-5.4 costs $2.50 per 1M input tokens and $15.00 per 1M output tokens, while Claude Opus 4.6 costs $5 per 1M input tokens and $25 per 1M output tokens. DeepSeek — at least on benchmarks — delivers similar performance to these models at a 50-80% cost reduction. OfficeChai

The timing is not accidental. OpenAI shipped GPT-5.5 the same day. DeepSeek needed a launch window where an open-source 1M-context MoE at a fraction of the cost would not be buried under a closed-source announcement. Ofox Shipping on the same day as your biggest competitor's release is a calculated move.

The V3.2 to V4-Pro jump on Arena AI's live code leaderboard is 88 Elo — roughly the same delta between the third and thirteenth ranked models on the current board. It is a genuine generational step, not a refresh. Ofox

The MCPAtlas Public benchmark in the LinkedIn post — where V4-Pro-Max scores 73.6 against Opus 4.6's 73.8 — is the number that stands out most for anyone building MCP-integrated agent pipelines. Open-source is now essentially at parity on structured tool use. That's the gap that just closed.

Availability and Access

The weights are hosted on Hugging Face and ModelScope in FP8 and FP4+FP8 mixed precision formats, released under the MIT License for research and commercial use. Android Sage

DeepSeek's pricing sits at $0.14/million tokens input and $0.28/million tokens output for Flash, and $1.74/million input and $3.48/million output for Pro. Simon Willison The API is live today via OpenRouter and DeepSeek's own endpoint, supporting both OpenAI ChatCompletions and Anthropic protocols.

Running a 1.6T parameter model locally requires significant GPU infrastructure — even in FP4+FP8 mixed precision, the memory requirements are substantial. Android Sage For most teams, the API is the practical path. Flash-Max gives you near-Pro reasoning at Flash pricing, which is the configuration worth benchmarking against your specific workloads first.

The gap between open-source and frontier AI just got measurably smaller — and for the first time, in some categories that actually matter for production agentic systems, it's not a gap at all. The question for teams running closed-source models at frontier prices is no longer "when will open-source catch up?" It's "what are we still paying for?"

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Microsoft Fabric Just Exposed Its MCP Architecture. Here's What It Actually Changes for Data Teams.

Om Shree — Fri, 24 Apr 2026 01:17:59 +0000

Enterprise data platforms have spent decades building walls around their data. Microsoft just shipped the protocol that lets AI agents walk through those walls — natively, securely, and without a single custom integration.

The Problem It's Solving

Every time an engineering team wants to connect an AI agent to a data platform, they rebuild the same plumbing from scratch: OAuth2 flows, token management, rate-limiting logic, API versioning, error handling. That's before the agent even does anything useful. Multiply that across a company running GitHub Copilot, Claude, Cursor, and Copilot Studio simultaneously, and the integration surface becomes unmanageable.

The deeper issue is that AI tools have no shared language for talking to enterprise systems. Each integration is bespoke, brittle, and built by someone who had better things to do. The agent either gets too little context or too much — and neither produces reliable outputs against production infrastructure.

How the Fabric MCP Architecture Actually Works

Microsoft Fabric is now shipping two distinct MCP entry points, each targeting a different level of autonomy.

Fabric Local MCP is now Generally Available. It's an open-source server that runs on the developer's machine, giving AI assistants deep knowledge of Fabric's APIs. It also enables local-to-cloud data operations — upload data to OneLake, create items, inspect table schemas — all within a single conversation. The Local MCP can wrap the Fabric CLI as tools, meaning CI/CD pipelines can use it to deploy changes with no human in the loop. Authentication is integrated, so there's no manual token management. The recommended install path is a VS Code extension that configures everything automatically.

Fabric Remote MCP is in Preview. This is the cloud-hosted server — no local setup required. It lets AI agents perform authenticated operations directly in a Fabric environment: managing workspaces, handling permissions, executing tasks on behalf of teams. This is the entry point for autonomous agents running in Copilot Studio, not developers pair-programming at a terminal.

Both run inside the security model, audit trail, and RBAC boundaries Fabric already enforces. The agents can only access what the authenticated user can access. There are no additional roles to provision, no shadow permissions, no new attack surface to manage.

The underlying protocol making this possible is MCP — originally created by Anthropic and now adopted by GitHub, Cloudflare, Stripe, and a growing list of enterprise platforms. Rather than creating unique integrations for each AI tool, exposing the platform as an MCP server means any MCP-compatible client can connect instantly. Microsoft Fabric

What Teams Are Actually Using It For

The use cases split cleanly by role.

A developer building a data pipeline uses the Local MCP to let GitHub Copilot or Claude look up the correct Fabric API spec, generate code against it, upload data to OneLake, and validate the result — all within one conversation thread. The agent isn't guessing at APIs or hallucinating parameter names. It's reading the live spec through the MCP server.

A data team running autonomous workflows points Copilot Studio at the Remote MCP. The agent provisions workspaces, adjusts permissions, and manages resources on behalf of the team without anyone opening the Fabric portal.

A CI/CD pipeline uses the Fabric CLI wrapped as MCP tools to deploy changes on a schedule, no human in the loop, no interactive auth required.

And separately, OneLake MCP is now Generally Available as part of the same extension, letting agents traverse the full OneLake hierarchy — from workspace to item to table schema to physical Delta Lake files — through natural language. An admin could ask an agent to inventory every item in a workspace, a data engineer could check table optimization across lakehouses, and an analyst could explore an unfamiliar dataset without writing a query. Microsoft Fabric

Why This Is a Bigger Deal Than It Looks

When Microsoft previewed the Fabric Local MCP in October, the announcement became one of their most-read posts, approaching 100K views. Microsoft Fabric That's not a vanity metric — it's a signal that data engineers are actively looking for exactly this kind of native agent integration, not another middleware layer to manage.

The more consequential signal is architectural. Microsoft didn't build a Fabric-specific agent framework. They implemented MCP — the same protocol Anthropic, GitHub, Cloudflare, and Stripe are converging on — and exposed Fabric through it. That's a deliberate bet that the agentic ecosystem will standardize on one protocol, and that being MCP-native is table stakes for enterprise platforms going forward.

The analogy Microsoft uses in their own post is precise: MCP is to AI what USB was to hardware — a universal connector that replaces a tangle of proprietary cables with a single standard. Microsoft Fabric USB didn't make hardware more capable. It made capability composable. That's exactly what MCP does for data infrastructure.

For teams evaluating where to build agentic data workflows, this changes the calculus. Fabric is no longer just a lakehouse or a BI platform. It's now a surface that any MCP-compatible agent can operate against, with enterprise-grade governance baked in, not bolted on.

Availability and Access

Fabric Local MCP is Generally Available now. Install via the VS Code Marketplace extension — it configures automatically and works with GitHub Copilot, Cursor, Claude Desktop, and any MCP-compatible client. Fabric Remote MCP is in Preview. OneLake MCP tools ship automatically as part of the Fabric MCP extension if you already have it installed — no additional configuration required.

The question enterprise data teams should be asking isn't whether to adopt MCP-native tooling. It's how quickly they can deprecate the custom integration layers they've already built. That migration just got a lot easier.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Google Just Launched an Official Agent Skills Repository. Here's What It Actually Solves.

Om Shree — Fri, 24 Apr 2026 01:15:08 +0000

Google just shipped an official repository of Agent Skills at Google Cloud Next 2026. It's a quiet announcement, but it points at one of the most persistent unsolved problems in production agentic AI.

The Problem It's Solving

MCP servers were supposed to fix context. Give your agent a live, grounded connection to documentation, and it wouldn't hallucinate outdated APIs or confuse one SDK version with another. And that largely works — Google already runs an MCP server for its developer docs.

But there's a compounding cost. When agents lean heavily on MCP servers, they pull massive amounts of context into their window on every request. The model drowns in raw documentation, token costs spike, and coherence drops. The community calls this "context bloat," and it gets worse the more products an agent is expected to know.

The real gap isn't access to information. It's the absence of condensed, agent-optimized expertise that loads on demand rather than all at once.

How Agent Skills Actually Works

Agent Skills is an open format — originally developed by Anthropic and released as a community standard — for giving agents packaged, structured expertise. At its core, a skill is a folder containing a SKILL.md file with metadata and task-specific instructions. It can also bundle scripts, reference docs, templates, and other assets. Think of it as agent-first documentation: compact, purposeful, and written for a machine that needs to act, not just read.

The mechanism that makes it practical is progressive disclosure. At startup, an agent loads only the name and description of each available skill — just enough to know whether a skill is relevant to the current task. When there's a match, the full instructions are pulled into context. The agent then executes, optionally running bundled scripts or referencing additional files.

Full context only loads when it's actually needed. That's the design decision that separates this from dumping a documentation site into a system prompt.

What Developers Are Actually Using It For

Google's official repository launches at github.com/google/skills with thirteen skills out of the gate. Seven are product-specific: AlloyDB, BigQuery, Cloud Run, Cloud SQL, Firebase, the Gemini API, and GKE. Three map to the Well-Architected Framework pillars — Security, Reliability, and Cost Optimization. And three are "recipe" skills covering onboarding, authentication, and network observability.

Installing them is a single command: npx skills install github.com/google/skills. They work across Antigravity, Gemini CLI, and any third-party agent that implements the Skills spec.

The product-specific skills are the immediately practical ones. An agent working against BigQuery or GKE no longer needs to maintain a live MCP connection to documentation just to get accurate syntax, service limits, or recommended patterns. The skill carries that knowledge, loads it when relevant, and stays out of the way otherwise.

Why This Is a Bigger Deal Than It Looks

The Agent Skills format wasn't built by Google — it was built by Anthropic and open-sourced. Google adopting it as the vehicle for their official documentation layer is a meaningful signal: this is becoming infrastructure, not a framework-specific feature.

For teams building agents on Google Cloud, the practical implication is real. You can now equip an agent with accurate, maintained, Google-authored knowledge about Cloud Run or Firebase without inflating every prompt with raw documentation. The skills are versioned, auditable, and composable — which matters when you're running multi-step workflows across multiple GCP products.

The deeper shift here is architectural. MCP solved access. Agent Skills solves delivery. They're complementary, and the combination starts to look like a serious answer to the context problem that's been quietly breaking production agents for the past year.

Availability and Access

The repository is live now at github.com/google/skills. Google has confirmed additional skills will ship in the coming weeks and months. The Agent Skills format spec is open, meaning any agent platform can implement support, and any team can build and distribute their own skills using the same structure.

Context bloat has been treated like an engineering nuisance. Google just made the case that it's an infrastructure problem — and shipped a solution for it. The question now is how quickly the rest of the ecosystem follows with their own official skills repositories.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

The Bitwarden CLI Just Got Backdoored. Here's What the Supply Chain Attack Actually Did.

Om Shree — Fri, 24 Apr 2026 01:11:43 +0000

Bitwarden serves over 10 million users and 50,000 businesses. On April 22, 2026, for exactly 93 minutes, its CLI was shipping malware.

This was not a phishing campaign. Nobody tricked a Bitwarden employee into clicking a link. The attackers walked straight through the CI/CD pipeline, injected malicious code into an official release, and let Bitwarden's own npm publishing mechanism do the distribution for them.

The Problem It's Solving (For the Attackers)

Supply chain attacks are effective precisely because they subvert trust. You don't need to compromise a developer's machine directly if you can compromise the tool they're already running with elevated permissions in their build pipeline.

The affected package version was @bitwarden/cli@2026.4.0, and the malicious code was published in bw1.js, a file included in the package contents. The Hacker News The Bitwarden CLI sits in a privileged position in developer environments — it's commonly used for secrets injection and automated deployments in CI/CD pipelines. That makes it a high-value target.

The compromise was connected to the ongoing Checkmarx supply chain campaign, with a threat group hijacking the npm package and injecting malicious code designed to steal sensitive data from developer workstations and CLI environments. Security Boulevard The researchers who caught it — Socket, JFrog, Ox Security, and StepSecurity — identified it as part of a broader pattern that has been running since at least March 2026.

How the Attack Actually Worked

The attackers gained access by exploiting a GitHub Actions workflow in Bitwarden's CI/CD pipeline, mirroring previously documented techniques in the Checkmarx campaign, where threat actors leveraged stolen credentials to inject malicious workflows, exfiltrate secrets, and tamper with build outputs before distribution. CyberInsider

Once inside, the payload was embedded quietly into a legitimate build. The malicious payload, embedded in a file called bw1.js, ran during package installation and harvested GitHub and npm tokens, SSH keys, environment variables, shell history, and cloud credentials. Yahoo!

The exfiltration destination is worth noting. The stolen data was encrypted with AES-256-GCM and exfiltrated to audit.checkmarx[.]cx, a domain impersonating Checkmarx. The Hacker News Typosquatting a security company's domain to hide malware traffic is a particularly cynical touch.

The blast radius doesn't stop at the developer's machine either. If GitHub tokens are found, the malware weaponizes them to inject malicious Actions workflows into repositories and extract CI/CD secrets — meaning a single developer with the affected version installed can become the entry point for a broader supply chain compromise, with the attacker gaining persistent workflow injection access to every CI/CD pipeline the developer's token can reach. The Hacker News

There's also an attribution wrinkle that security researchers are still untangling. While the shared tooling strongly suggests a connection to the same malware ecosystem as the Checkmarx campaign, the operational signatures differ: the ideological branding is embedded directly in the malware, from the Shai-Hulud repository names to the "Butlerian Jihad" manifesto payload to commit messages proclaiming resistance against machines. Socket That points to either a splinter group or a campaign evolution — not a clean attribution to TeamPCP, who claimed the original Checkmarx attack.

What Developers Are Actually Exposed To

Only the npm CLI package was affected. Bitwarden's Chrome extension, MCP server, and other official distribution channels remain uncompromised. Cyber Security News

The malicious package was active between 5:57 PM and 7:30 PM ET on April 22, 2026. Bitwarden That's a 93-minute exposure window. Narrow, but enough.

Security researcher Adnan Khan noted this is the first known compromise of a package using npm's trusted publishing mechanism, which was designed to eliminate long-lived tokens. BeInCrypto That's significant. Trusted publishing was supposed to be the hardened path. The attackers didn't bypass it — they compromised the GitHub Actions workflow upstream of it, then let the trusted mechanism publish for them.

TeamPCP's broader campaign separately targets crypto wallet data, including MetaMask, Phantom, and Solana wallet files, and has chained similar attacks against Trivy, Checkmarx, and LiteLLM since March 2026, targeting developer tools that sit deep in build pipelines. Yahoo!

Why This Is a Bigger Deal Than It Looks

Bitwarden confirmed the incident but contained the framing carefully: no end-user vault data was accessed, no production systems compromised. That's true, and it matters. But the more important story here isn't about Bitwarden specifically.

The attack is another example of the increasing cybersecurity risks to CI/CD architectures as they become more foundational in the software development pipeline, with threat actors expanding their targeting of them in supply chain campaigns. Security Boulevard

The vector — GitHub Actions compromise leading to poisoned npm releases — is repeatable. It has been used against Trivy, Checkmarx's own tooling, and LiteLLM. The Bitwarden compromise isn't an isolated incident; it's the latest iteration of a campaign that is actively refining its technique against high-trust developer tooling.

And the MCP angle is worth flagging: the malicious bw1.js payload shares core infrastructure with the previously analyzed mcpAddon.js, including an identical C2 endpoint. Cyber Security News As MCP servers proliferate across developer toolchains, they're becoming targets in the same supply chain vector. The attack surface is expanding in lockstep with adoption.

Availability and Access (What To Do Right Now)

If you installed @bitwarden/cli@2026.4.0 during the window: treat your environment as fully compromised. Rotate every secret the machine had access to — GitHub tokens, npm credentials, cloud provider keys, SSH keys, everything.

Socket recommends downgrading to version 2026.3.0 or switching to official signed binaries from Bitwarden's website. BeInCrypto

On endpoints and runners, hunt for outbound connections to audit[.]checkmarx[.]cx, execution of Bun where it is not normally used, and access to files such as .npmrc, .git-credentials, .env, and cloud credential stores. For GitHub Actions, review whether any unapproved workflows were created on transient branches. Socket

A CVE for Bitwarden CLI version 2026.4.0 is being issued in connection with this incident. CyberInsider

The Bitwarden CLI attack is a clean demonstration of where the real risk in developer infrastructure lives right now: not in the applications themselves, but in the build systems that ship them. One poisoned GitHub Actions workflow, one 93-minute publish window, and a trusted tool becomes a credential harvester running inside your pipeline with your own permissions.

The question isn't whether your password manager's vault is safe. It's whether you can verify the integrity of every tool in your build chain. Most teams can't.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Google Just Split Its TPU Into Two Chips. Here's What That Actually Signals About the Agentic Era.

Om Shree — Thu, 23 Apr 2026 03:47:11 +0000

Training and inference have always had different physics. Google just decided to stop pretending one chip could handle both.

At Google Cloud Next '26 on April 22, Google announced the eighth generation of its Tensor Processing Units — but for the first time in TPU history, that generation isn't a single chip. It's two: the TPU 8t for training, and the TPU 8i for inference and agentic workloads. That architectural split is the most meaningful signal in this announcement, and most coverage has buried it.

The Problem It's Solving

Standard RAG retrieves. Agents reason, plan, execute, and loop back. That distinction matters enormously at the infrastructure level.

Chat-based AI inference has a relatively forgiving latency budget. A user submits a prompt, waits a second or two, reads the response. Agentic workflows don't work that way. A primary agent decomposes a goal into subtasks, dispatches specialized agents, collects results, evaluates them, and decides what to do next — all in real time, potentially across thousands of concurrent sessions. The per-step latency compounds. If your inference chip is optimized for throughput over latency (which it was, because that's what training needs), you end up with agent loops that are sluggish, expensive, and hard to scale.

Previous TPU generations, including last year's Ironwood, were pitched as unified flagship chips. Google's internal experience running Gemini, its consumer AI products, and increasingly complex agent workloads apparently showed that a single architecture forces uncomfortable trade-offs. So they split the roadmap.

How the TPU 8t and TPU 8i Actually Work

The TPU 8t is the training powerhouse. It packs 9,600 chips in a single superpod to provide 121 exaflops of compute and two petabytes of shared memory connected through high-speed inter-chip interconnects. That's roughly 3x higher compute performance than the previous generation, with doubled ICI bandwidth to ensure that massive models hit near-linear scaling. At the cluster level, Google can now connect more than one million TPUs across multiple data center sites into a training cluster — essentially transforming globally distributed infrastructure into one seamless supercomputer.

The TPU 8i is the more architecturally interesting chip. With 3x more on-chip SRAM over the previous generation, TPU 8i can host a larger KV Cache entirely on silicon, significantly reducing the idle time of the cores during long-context decoding. The key innovation is a component called the Collectives Acceleration Engine (CAE) — a dedicated unit that aggregates results across cores with near-zero latency, specifically accelerating the reduction and synchronization steps required during autoregressive decoding and chain-of-thought processing. The result: on-chip latency of collectives drops by 5x.

Google also redesigned the inter-chip network topology specifically for 8i. The previous 3D torus topology prioritized bandwidth. For 8i, Google changed how chips connect together using fully connected boards aggregated into groups — a high-radix design called Boardfly that connects up to 1,152 chips together, reducing the network diameter and the number of hops a data packet must take to cross the system, achieving up to a 50% improvement in latency for communication-intensive workloads.

In raw spec terms, the 8i delivers 9.8x the FP8 EFlops per pod, 6.8x the HBM capacity per pod, and a pod size that grows 4.5x from 256 to 1,152 chips compared to the prior generation.

The economic headline: TPU 8i delivers 80% better performance per dollar for inference than the prior generation.

What Teams Are Actually Using This For

The split architecture is most directly useful for three categories of workload.

Frontier model training at labs and large enterprises. TPU 8t was designed in partnership with Google DeepMind and is built to efficiently train world models like DeepMind's Genie 3, enabling millions of agents to practice and refine their reasoning in diverse simulated environments. If you're training large proprietary models, the 8t's near-linear scaling at million-chip clusters changes the economics of when you can afford to retrain.

High-concurrency agentic inference is where the 8i shines. Multi-agent pipelines, MoE model serving, chain-of-thought reasoning loops — all of these hammer the all-to-all communication patterns that the Boardfly topology specifically addresses. The implication is lower latency per agent step at scale, which compounds significantly when you're running thousands of parallel agent sessions.

Reinforcement learning post-training sits between the two. Google's new Axion-powered N4A CPU instances handle the complex logic, tool-calls, and feedback loops surrounding the core AI model — offering up to 30% better price-performance than comparable agent workloads on other hyperscalers. The intended stack is TPU 8t for pre-training, TPU 8i for RL and inference, and Axion for orchestration logic.

Google is also wrapping all of this in upgraded networking. The Virgo Network's collapsed fabric architecture offers 4x the bandwidth of previous generations and can connect 134,000 TPUs into a single fabric in a single data center. Storage got overhauled too: Google Cloud Managed Lustre now delivers 10 TB/s of bandwidth — a 10x improvement over last year — with sub-millisecond latency via TPUDirect and RDMA, allowing data to bypass the host and move directly to the accelerators.

Why This Is a Bigger Deal Than It Looks

The obvious read on this announcement is "Google vs. Nvidia." That framing is mostly wrong, and Google itself isn't pretending otherwise. Google promises its cloud will have Nvidia's latest chip, Vera Rubin, available later this year, and the two companies are co-engineering the open-source Falcon networking protocol via the Open Compute Project. This is not a replacement strategy — it's a portfolio strategy.

The more important signal is what the architectural split says about where the AI workload is going. Seven generations of TPUs were built on the assumption that training and inference are different phases of the same pipeline — you train, then you serve. The 8t/8i split encodes a different belief: that agentic inference is so architecturally distinct from training that they require fundamentally different silicon. That's a bet on the permanence of agentic workflows, not just a current optimization.

For enterprise buyers, the TPU v8 reframes the 2026–2027 cloud evaluation in concrete ways: teams training large proprietary models should look at 8t availability windows and Virgo networking access. Teams serving agents or reasoning workloads should evaluate 8i on Vertex AI and whether HBM-per-pod sizing fits their context windows.

There's also a vertical integration argument here that's easy to underestimate. Google co-designs its chips with DeepMind, runs them on its own networking fabric, manages its own storage layer, and orchestrates everything through GKE. Native PyTorch support for TPU — TorchTPU — is now in preview with select customers, allowing models to run on TPUs as-is with full support for native PyTorch Eager Mode. That removes one of the biggest friction points developers have historically had with TPUs: you no longer need to rewrite your training code to access Google's silicon. Combined with vLLM support on TPU, the migration path from an Nvidia-based setup is shorter than it's ever been.

Availability and Access

TPU 8t and TPU 8i will be available to Cloud customers later in 2026. You can request more information now to prepare for their general availability. The chips are integrated into Google's AI Hypercomputer stack, supporting JAX, PyTorch, vLLM, and XLA. Deployment options range from Vertex AI managed services to GKE for teams that want infrastructure-level control.

The honest caveat: these are self-reported benchmarks against Google's own prior generation. Independent third-party numbers from cloud customers and evaluators will emerge over the next two quarters, and those will be the numbers that actually matter for procurement decisions.

The split TPU roadmap isn't just a chip announcement — it's Google encoding its architectural thesis about what AI infrastructure looks like in an agentic world directly into silicon. Every other hyperscaler is going to have to answer the same question: do you build one chip to do everything, or do you specialize?

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

NeoCognition Just Raised $40M to Fix the One Thing Every AI Agent Gets Wrong

Om Shree — Thu, 23 Apr 2026 03:34:27 +0000

Every AI agent demo looks impressive until you actually depend on one. That 50% task completion rate you've quietly accepted as "normal"? NeoCognition just called it out directly, and raised $40 million to do something about it.

The Problem It's Solving

The foundational critique that NeoCognition is building on is blunt: current agents — whether from Claude Code, OpenClaw, or Perplexity's computer tools — successfully complete tasks as intended only about 50% of the time. That is not a UX problem or a prompt engineering problem. It's a structural one. Today's agents are stateless generalists. They bring no accumulated knowledge of your environment, your workflows, or your domain's specific constraints to each task. Every time you invoke one, it's starting from scratch.

The standard industry response to this has been fine-tuning — custom-engineering an agent for a specific vertical and hoping it holds. That works until the domain shifts, the tooling changes, or you need to deploy the same agent somewhere new. Then you're back to zero.

How NeoCognition Actually Works

NeoCognition was started by Yu Su, Xiang Deng, and Yu Gu, who all worked together in Su's AI agent lab at Ohio State University. Su's team began developing LLM-based agents before the ChatGPT moment, and their research — including Mind2Web and MMMU — is now used by OpenAI, Anthropic, and Google. This is not a product team that pivoted into agents. It's the research behind the agents you're already using, now building something opinionated about what those agents got wrong.

The core thesis is drawn from how humans actually acquire expertise. NeoCognition's agents continuously learn the structure, workflows, and constraints of the environments they operate in, and specialize into domain experts by learning a world model of work. The phrase "world model" is doing significant work here. Rather than applying general reasoning to every task, these agents are designed to build an internal map of a specific micro-environment — its rules, its dependencies, its edge cases — and continuously refine that map through experience.

The Palo Alto startup argues that its agents learn on the job as specialists rather than relying on fixed general training, which is the architectural distinction that matters. Fixed training is a snapshot. A world model grows.

What Enterprises Are Actually Using It For

NeoCognition's primary target is the enterprise market, and specifically the SaaS layer. NeoCognition intends to sell its agent systems primarily to enterprises, including established SaaS companies, which can use them to build agent workers or to enhance existing product offerings. The framing here is interesting: they're not just selling agents to enterprises, they're selling the infrastructure for SaaS companies to make their own products agentic.

The Vista Equity Partners participation is strategic, not just financial. As one of the largest private equity firms in the software space, Vista can provide NeoCognition with direct access to a vast portfolio of companies looking to modernize their products with AI. That's a go-to-market lever, not just a check. You don't close Vista for the cap table optics — you close them because they own the distribution you need.

The deeper implication for enterprises is the safety argument. Deeper understanding of their environments enables NeoCognition's agents to be more responsible and safer actors in high-stake settings. An agent that understands why a workflow exists — not just what the workflow is — is less likely to take a technically correct action that's contextually wrong. That's the difference between a tool and a trusted system.

Why This Is a Bigger Deal Than It Looks

The investor list deserves more attention than most coverage is giving it. Angel investors and founding advisors include Lip-Bu Tan, CEO of Intel, Ion Stoica, co-founder and executive chairman of Databricks, and leading AI researchers like Dawn Song, Ruslan Salakhutdinov, and Luke Zettlemoyer. That last trio — Song, Salakhutdinov, Zettlemoyer — are foundational researchers in modern deep learning and NLP. When researchers of that caliber put their names on a company, they're endorsing the technical thesis, not just the team.

The timing reflects a broader pattern in AI investment in 2026: capital is increasingly flowing not towards frontier model development — dominated by a small number of well-capitalized labs — but towards the infrastructure and agent layer above it. The model wars are effectively over for now. The next real competition is in what those models can reliably do, and that's an infrastructure and learning problem, not a parameter-count problem.

What NeoCognition is proposing — agents that build structured world models of their operating environments — is also the missing architectural primitive for MCP-based agent pipelines. Right now, most agentic systems using MCP are still stateless: each tool call happens in context, but the agent isn't learning the tool ecosystem it operates in. An agent layer that builds persistent, structured knowledge of its environment and the tools available to it would meaningfully change what's achievable in production agentic workflows.

Availability and Access

NeoCognition has just emerged from stealth, so there's no public product available yet. The company currently has about 15 employees, the majority of whom hold PhDs. This is explicitly still a research-to-product transition — the $40M is funding that transition. Enterprise access will likely come through direct partnership channels, given the Vista relationship and the SaaS-first go-to-market. Developers wanting to follow the research can track Su's prior work through his Ohio State lab page.

The 50% reliability ceiling on current agents isn't a model problem — it's a memory and specialization problem. NeoCognition is making a structural bet that the next unlock in agent reliability isn't more parameters; it's agents that actually learn where they're deployed. If they're right, the companies building on today's stateless agent architectures are building on borrowed time.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Google's Project Jitro Just Redefined What a Coding Agent Is. Here's What It Actually Changes.

Om Shree — Wed, 22 Apr 2026 03:35:56 +0000

Project Jules used to tell your AI what to do. Jitro tells it what you want. That gap — between task execution and outcome ownership — is the entire bet Google is making with its next-generation coding agent.

The Problem With Every Coding Agent Right Now

Every major AI coding tool today, GitHub Copilot, Cursor, Windsurf, OpenAI's Codex — operates on the same underlying model: you define the work, the agent does it. You write the prompt, you review the output, you write the next prompt. The developer is still the scheduler, the project manager, and the QA team. The AI is a very fast, very capable executor.

That's genuinely useful. But it hits a ceiling. When your goal is "reduce memory leaks in the backend by 20%" or "get our accessibility score to 100%," you don't want to translate that into ten sequential prompts across a week. You want to hand it off. No current tool actually lets you do that.

How Project Jitro Actually Works

Google is internally developing Project Jitro as an autonomous AI system that moves beyond prompt-based coding to independently execute high-level development goals. It's built on Jules, Google's existing asynchronous coding agent — but the architecture is meaningfully different.

Rather than asking developers to manually instruct an agent on what to build or fix, Jules V2 appears designed around high-level goal-setting — KPI-driven development, where the agent autonomously identifies what needs to change in a codebase to move a metric in the right direction.

The workspace model is the critical piece. A dedicated workspace for the agent suggests Google envisions Jitro as a persistent collaborator rather than a one-shot tool. Early signals point to a workspace where developers can list goals, track insights, and configure tool integrations — a layer of continuity that current coding agents don't offer.

From leaked tooling definitions, the Jitro workspace API exposes operations like: list goals, create a goal after helping articulate it clearly, list insights, get update history for an insight, and list configured tool integrations including MCP remote servers and API connections. That last item is significant — Jitro integrates through Model Context Protocol (MCP) remote servers and various API connections to ensure it has the context it needs.

Transparency is baked in by design. When you set a goal in the Jitro workspace, the AI doesn't just operate silently — it surfaces its reasoning process, explaining why it chose a specific library or restructured a database table. You stay in control by approving the general direction, while the AI handles the execution.

What Engineering Teams Are Actually Going to Use This For

The use cases where this model genuinely wins are the ones that are currently painful in proportion to their importance: reducing error rates becomes the objective instead of debugging individual functions; improving test coverage becomes the target instead of writing test cases manually across multiple files; increasing conversions becomes the priority instead of adjusting isolated page elements without strategy alignment.

The primary beneficiaries would be engineering teams managing large codebases where incremental improvements compound — performance optimization, test coverage, accessibility compliance.

Jules V1 already demonstrated that the asynchronous model works. During the beta, thousands of developers tackled tens of thousands of tasks, resulting in over 140,000 code improvements shared publicly. Jules is now out of beta and available across free and paid tiers, integrated into Google AI Pro and Ultra subscriptions. Jitro inherits that async foundation and extends it to goals that span sessions, not just tasks.

Why This Is a Bigger Deal Than It Looks

The shift from prompt-driven to goal-driven AI isn't a UX improvement — it's a change in the unit of work. Right now, developer productivity is measured by how good your prompts are. Jitro changes that to how clearly you can define outcomes.

Routine tasks like debugging, writing boilerplate code, or running tests may increasingly be handled by AI systems. As a result, developers may shift toward higher-level responsibilities — guiding AI systems, reviewing outputs, and aligning technical work with business goals.

This marks a departure from the task-level paradigm seen across competitors like GitHub Copilot, Cursor, and even OpenAI's Codex agent, all of which still rely on developers defining specific work items. If Jitro ships as described, it resets what the category baseline looks like. Every competitor will be asked why their tool still needs a prompt for every action.

The MCP integration angle is also worth watching closely. A goal-oriented coding agent that natively connects to MCP remote servers can reach across your entire toolchain — CI/CD, monitoring, issue trackers — rather than reasoning only over local files. That's a different class of tool.

The honest caveat: the risk is that autonomous goal-pursuing agents introduce unpredictable changes, and trust will be the key barrier to adoption. None of the UI is visible yet, so the full scope remains unclear. There's a real question about what "approve the direction" actually looks like in practice when the agent is making dozens of decisions across a large codebase.

Availability and Access

Project Jitro is still pre-launch. The upcoming experience is expected to launch under a waitlist, with Google I/O 2026 on May 19 as the likely announcement moment alongside broader Gemini ecosystem updates. The Jules team has published a waitlist page with messaging that reads: "Manually prompting your agents is so… 2025."

Current Jules users on Google AI Pro and Ultra are the most likely early access recipients. No public timeline beyond "2026" has been confirmed.

The line between "AI that helps you code" and "AI that owns a development objective" is the line Jitro is trying to cross. Whether it lands or not at I/O, the framing alone forces every other coding tool to answer the same question: how long until your users stop writing prompts?

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Anthropic's Most Dangerous Model Just Got Accessed by People Who Weren't Supposed to Have It

Om Shree — Wed, 22 Apr 2026 01:28:17 +0000

Anthropic built a model so dangerous they refused to release it publicly. Then a Discord group got in anyway.

The Model They Wouldn't Ship

Claude Mythos Preview is Anthropic's most capable model to date for coding and agentic tasks. Anthropic But it was never meant to reach the public. During testing, Mythos improved to the point where it mostly saturated existing cybersecurity benchmarks, prompting Anthropic to shift focus to novel real-world security tasks — specifically zero-day vulnerabilities, bugs that were not previously known to exist. Anthropic

What they found was stark. Mythos Preview had already identified thousands of zero-day vulnerabilities across critical infrastructure — many of them critical — in every major operating system and every major web browser. Anthropic In one documented case, Mythos fully autonomously identified and exploited a 17-year-old remote code execution vulnerability in FreeBSD that allows anyone to gain root on a machine running NFS. No human was involved in either the discovery or exploitation of this vulnerability after the initial request to find the bug. Anthropic

This is why the model never went public.

Project Glasswing: The Controlled Release

Announced on April 7, Mythos was deployed as part of Anthropic's "Project Glasswing," a controlled initiative under which select organizations are permitted to use the unreleased Claude Mythos Preview model for defensive cybersecurity. Yahoo!

Launch partners included Amazon Web Services, Anthropic, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, NVIDIA, and Palo Alto Networks. Access was also extended to over 40 additional organizations that build or maintain critical software infrastructure. Anthropic The logic was clear: get defenders ahead of the curve before the capabilities proliferate to actors who won't use them carefully.

Claude Mythos Preview is available to Project Glasswing participants at $25/$125 per million input/output tokens, accessible via the Claude API, Amazon Bedrock, Google Cloud's Vertex AI, and Microsoft Foundry. Anthropic committed $100M in model usage credits to cover Project Glasswing throughout the research preview. Anthropic

The perimeter was tight by design. The news today is that it didn't hold.

How the Discord Group Got In

A "private online forum," the members of which have not been publicly identified, managed to gain access to the tool through a third-party vendor. The unauthorized group tried a number of different strategies to gain access to the model, including using "access" enjoyed by a person currently employed at a third-party contractor that works for Anthropic. TechCrunch

Members of the group are part of a Discord channel that seeks out information about unreleased AI models. The group has been using Mythos regularly since gaining access to it, and provided evidence to Bloomberg in the form of screenshots and a live demonstration of the software. TechCrunch

The method they used to find the endpoint is particularly revealing. The group, which gained access on the very same day Mythos was publicly announced, "made an educated guess about the model's online location based on knowledge about the format Anthropic has used for other models." TechCrunch This wasn't a sophisticated breach — it was pattern recognition applied to a known naming convention. The group reportedly described themselves as being interested in exploring new models, not causing harm.

Anthropic said it is investigating the claims and, so far, has seen no sign that its own systems were affected — the allegation points to possible misuse of access outside Anthropic's core network, not a confirmed breach of the company's internal defenses. Prism News

Why This Is a Bigger Deal Than It Looks

The immediate reassurance — no core systems compromised, the group wasn't malicious — is accurate but beside the point. The problem isn't what this specific group did. It's what this incident reveals about the entire premise of Project Glasswing.

Anthropic's controlled release strategy rests on the assumption that access can be meaningfully gated through vendor relationships. A small group of unauthorized users reportedly accessed Mythos on the same day Anthropic announced limited testing Prism News — meaning the access controls failed within hours of the first public announcement, before most Glasswing partners had even begun their work. If the group could guess the model's endpoint from Anthropic's known URL patterns, so can threat actors with more resources and worse intentions.

There's also a pattern here worth naming. This is the third significant information control failure at Anthropic in recent weeks. The Claude Code source leak in March exposed 512,000 lines of unobfuscated TypeScript via a missing .npmignore entry. Before that, a draft blog post describing Mythos as "by far the most powerful AI model" ever built at Anthropic was left in a publicly accessible data store. That March 26 leak of draft materials — which Anthropic said resulted from human error in its content-management configuration — was actually Mythos's first public exposure. Prism News

Then there's the government subplot. The National Security Agency is using Mythos Preview despite top officials at the Department of Defense — which oversees the NSA — insisting Anthropic is a "supply chain risk." The department moved in February to cut off Anthropic and force its vendors to follow suit. The military is now broadening its use of Anthropic's tools while simultaneously arguing in court that using those tools threatens U.S. national security. Axios Meanwhile, CISA — the agency whose entire mandate is critical infrastructure protection — reportedly does not have access to the model. Axios

The entity designed to defend critical systems can't get in. A Discord group can.

What Anthropic Actually Said

"We're investigating a report claiming unauthorized access to Claude Mythos Preview through one of our third-party vendor environments," an Anthropic spokesperson said. The company found no evidence that the supposedly unauthorized activity impacted Anthropic's systems at all. TechCrunch

That's a factually careful statement. It's also a familiar shape: acknowledge the narrow, deny the broader implication. Anthropic has been here before.

The Vendor Problem Nobody Wants to Solve

The deeper structural issue is that enterprise AI deployments at frontier capability levels require trust chains that extend across dozens of organizations. Anthropic's 40-organization Glasswing rollout means 40 distinct security postures, 40 sets of contractors, and 40 potential lateral entry points for anyone who knows what they're looking for.

Anthropic said it does not plan to make Mythos Preview generally available, but its eventual goal is to enable users to safely deploy Mythos-class models at scale — for cybersecurity purposes, but also for the myriad other benefits that such highly capable models will bring. Simon Willison That goal is legitimate. But reaching it requires solving vendor access governance at a level the industry hasn't had to reckon with before. This incident is an early indication of what the stakes look like when the effort falls short.

A model capable of finding zero-days in every major operating system and browser has now been accessed by people outside the intended perimeter. The question isn't whether the Discord group caused harm. It's whether the perimeter can hold when the people on the other side are actually trying.

The line between "interested in playing around" and "interested in breaking things" isn't enforced by intent. It's enforced by access controls. Anthropic's have now failed twice in the same month.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.

Anthropic Just Passed OpenAI in Revenue. Here's What Actually Built That Lead.

Om Shree — Mon, 20 Apr 2026 06:46:06 +0000

A year ago, the consensus was that OpenAI had an insurmountable lead. The brand. The user base. ChatGPT with hundreds of millions of users. The head start. In April 2026, Anthropic crossed $30 billion in annualized revenue and left OpenAI's $25 billion behind — the first time any rival has led this race since ChatGPT launched in November 2022.

The Number That Shocked Even the Analysts

Anthropic's annualized revenue run-rate hit $30 billion in April 2026, officially overtaking OpenAI's $25 billion — the first time any rival has surpassed OpenAI since ChatGPT launched in 2022. Vucense

Epoch AI had modeled it. Analysts debated the timing. It was supposed to happen even under the most optimistic assessments in August 2026. It happened in April. SaaStr

The trajectory itself is the story. Anthropic went from $87 million run-rate in January 2024, to $1 billion by December 2024, to $9 billion by end of 2025, to $14 billion in February 2026, to $19 billion in March, to $30 billion in April. That last sequence — $14B to $30B in roughly 8 weeks — is hard to make sense of in traditional software terms. SaaStr

For context: Salesforce took about 20 years to reach $30 billion in annual revenue. Anthropic did it in under 3 years from a standing start. SaaStr

The Enterprise Bet That Everyone Underestimated

OpenAI's revenue composition is more consumer-heavy, with ChatGPT Plus and Pro subscriptions making up a large share. Anthropic's composition runs roughly 80% enterprise — higher retention, lower churn, and contracts that expand over time rather than cancelling when novelty fades. Robo Rhythms

The customer numbers make this concrete. Enterprise customers spending over $1 million annually doubled to 1,000+ in under two months. Eight of the Fortune 10 are Anthropic customers. Vucense

In the enterprise LLM API market, Anthropic accounts for 32% compared to OpenAI's 25%. Seven out of every ten new enterprise customers choose Anthropic. Tradingkey

Enterprise buyers treat a large funding round as a signal of platform stability. Companies that had been hesitant to commit multi-year API contracts moved forward after Anthropic's February 2026 Series G because Anthropic looked like it was in the race to stay. The doubling of $1M+ clients in under two months right after the Series G confirms that signal-driven buying happened at scale. Robo Rhythms

Claude Code: The Single Product That Changed Everything

None of this happens without Claude Code. Launched in May 2025, Claude Code reached an annualized revenue of $1 billion by November, and surpassed $2.5 billion by February 2026 — a product growing from zero to $2.5 billion in nine months. Reviewing SaaS industry history, no faster case has been found. KuCoin

Business subscriptions to Claude Code have quadrupled since the start of 2026, and enterprise use has grown to represent over half of all Claude Code revenue. Anthropic

Claude Code holds a 54% market share in the AI programming tool segment — far exceeding GitHub Copilot and Cursor. Tradingkey

The reason enterprises pay for it is structural, not incremental. GitHub Copilot helps you complete the next line as you write code — you're still the one doing the work. Claude Code doesn't just autocomplete; it handles entire workflows. KuCoin That's the difference between a feature and a budget line replacement.

And Claude Code is available on every major surface. Claude is the only frontier AI model available on all three of the world's largest cloud platforms: AWS Bedrock, Google Cloud Vertex AI, and Microsoft Azure Foundry. The-ai-corner

The Training Cost Gap Nobody Is Talking About Enough

Revenue is the headline. The cost structure is the real story.

OpenAI is projected to spend $125 billion per year on training by 2030. Anthropic's projection for the same period: around $30 billion. Same race. 4x difference in cost. The-ai-corner

OpenAI is burning approximately $17 billion in cash this year. Internal documents project a $14 billion loss for 2026. The company does not project positive free cash flow until 2029. Anthropic projects positive free cash flow by 2027 — three years ahead of its main competitor, while generating more revenue. SaaStr

A new agreement with Google and Broadcom will deliver approximately 3.5 gigawatts of next-generation TPU capacity starting in 2027. Rather than relying solely on Nvidia GPUs, Anthropic is diversifying across Google TPUs, AWS Trainium chips, and Nvidia hardware — matching workloads to the chips best suited for them. Medium

Anthropic is investing its revenue advantage into infrastructure before it needs it. That's a different kind of discipline than raising $120 billion and spending it on training runs.

Why This Is a Bigger Deal Than a Revenue Chart

The revenue story is inseparable from Anthropic's deliberate choice to prioritise enterprise over consumers. The $30B ARR is earned by being useful to businesses, not by harvesting user attention. Vucense

The Pentagon labelled Anthropic a supply chain risk for refusing to arm autonomous weapons with Claude. Revenue accelerated anyway — from $19B to $30B ARR in the weeks after that clash became public. The enterprise customer base that drives Anthropic's revenue appears to have either ignored or positively responded to Anthropic's refusal to compromise. Vucense

One caveat worth stating plainly: OpenAI has argued that Anthropic is using a gross revenue accounting treatment with its deals with Amazon and Google that inflates top-line figures. The real net figure, by OpenAI's accounting, would be lower. Gardenzhome That dispute isn't settled. But even accounting for it, the trajectory is real, the enterprise customer count is real, and the Claude Code numbers are real.

Availability and What It Means for Developers

Anthropic operates its models on a diversified range of AI hardware — AWS Trainium, Google TPUs, and NVIDIA GPUs — which means it can match workloads to the chips best suited for them. Anthropic

The IPO question is now live. Anthropic is targeting October 2026, aiming to raise $60B+ at a $380B valuation. No S-1 has been filed. The timeline is subject to market conditions and the SEC's review of accounting methodology questions. Vucense

Anthropic's $30 billion run rate exceeds the trailing twelve-month revenues of all but approximately 130 S&P 500 companies. A company that was essentially pre-revenue in early 2024 now out-earns most of the Fortune 500. Medium

The company that left OpenAI to build AI more carefully just built a bigger business doing it. That's not an accident — it's a thesis proving out in real time. The question now isn't whether Anthropic belongs in the same conversation as OpenAI. It's whether the enterprise-first, developer-first model it validated is the one the rest of the industry will be chasing for the next decade.

Follow for more coverage on MCP, agentic AI, and AI infrastructure.