Google shipped Gemini 3.1 Pro yesterday. It scores 77.1% on ARC-AGI-2, more than double its predecessor. It costs $2 per million input tokens, less than half of Claude Opus 4.6. And the demo that went viral wasn’t a benchmark chart. It was a pelican riding a bicycle.
An animated SVG. Pure code. The model reasoned about anatomy, physics, and animation timing for over five minutes, then produced a vector graphic where the pelican actually pedals. Jeff Dean’s tweet showed a full menagerie: a frog on a penny-farthing, a turtle kickflipping a skateboard, a dachshund driving a stretch limousine.
It’s a neat trick. But the interesting thing isn’t what Gemini can draw. It’s what developers are doing with the capability.
The Slot Machine
Scroll through any developer community right now and you’ll find the same pattern repeated with minor variations:
I use Claude for analysis, GPT for structuring, Perplexity for research, Gemini for multimodal, local tools for memory.— @alex_prompter on X
This isn’t one person’s quirky setup. It’s the emerging default. The thread had 130 likes and 27 retweets because developers saw their own workflow reflected back at them.
The combinations shift, but the structure is consistent:
- Claude for coding and analysis: Backend logic, debugging, architecture, refactoring. The consensus pick for tasks where code quality and reasoning depth matter most.
- Gemini for visual and multimodal: UI generation, screenshot analysis, image understanding, and now animated SVG. Its 1M token context window makes it the default for large-document ingestion.
- GPT for planning, structuring, and review: System design, documentation, outlining, code review. The model developers reach for when they need to organize before they build, or sanity-check after.
- Grok/Perplexity for research: Real-time data, web search, fact-checking. The “what’s happening right now” layer.
84% of developers now use AI tools, according to recent surveys. But increasingly, “AI tools” is plural. Developers aren’t choosing a model. They’re choosing a stack.
Code as Visual Representation
Kevin Lin, a researcher at Google, posted a thread that reframes why the SVG moment matters beyond the obvious wow factor. His argument: code is becoming a visual medium.
Here’s why that’s a bigger deal than it sounds. LLMs are notoriously bad at spatial reasoning. Research consistently shows that when you ask vision-language models whether one object is to the left or right of another, state-of-the-art models perform near random chance. GPT-4o scores 27.5% on questions about human viewpoint perspective. The LRR-Bench (literally “Left, Right or Rotate?”) found that spatial reasoning actively induces hallucinations in some models. These systems can write a thousand-line algorithm but can’t reliably tell you which side of the table the cup is on.
To generate an animated SVG, a model has to do exactly what LLMs are worst at. It has to imagine a scene it’s never seen, reason about spatial relationships and physics, maintain consistent left/right orientation across frames, and express all of that as abstract symbolic code that renders into something meaningful. That’s a multimodal reasoning task disguised as a coding task, and it targets the precise capability gap that has plagued every frontier model.
And SVG is just the starting point. The same capability extends to 3D scenes via Blender, interactive web experiences via HTML/CSS/JS, and mathematical visualizations in the style of 3Blue1Brown. One developer got Gemini 3.1 Pro to build a real-time ISS orbital tracking dashboard. Another got an interactive 3D voxel world running in the browser from a single prompt.
I built Squigglify on top of Gemini Pro 3 for exactly this reason. Kids draw a squiggle, Gemini interprets the strokes spatially and completes them into an animated illustration. It’s a Mr. Squiggle for the AI era. The whole product only works because Gemini can look at a rough set of coordinates on a canvas, reason about what the shape could become, and generate coherent line art that incorporates the original strokes. That’s spatial reasoning under real constraints, not a benchmark prompt.
This is why the animated pelican matters more than the benchmark scores. ARC-AGI-2 tells you a model can solve abstract puzzles. A pelican that actually pedals in the right direction, with legs on the correct side of the frame, tells you something about spatial intelligence that no multiple-choice benchmark captures.
The Familiar Architecture
If you squint, the polyglot AI stack maps cleanly to traditional software architecture:
- Claude is your backend: The heavy lifting. Complex logic, state management, the parts where correctness matters more than speed.
- Gemini is your frontend: The visual layer. Screenshots, UI, the parts where seeing matters more than reasoning in text.
- GPT is your middleware: The orchestration. Planning, routing, structuring. The glue that connects intent to implementation.
This isn’t a coincidence. Developers are applying the same architectural instinct they’ve always had: decompose the problem, route each piece to the tool that handles it best, compose the results.
The difference is that the “tools” are now foundation models, and the “routing” is often just the developer’s judgment about which CLI tool to open.
What Gemini 3.1 Pro Actually Changes
Beyond the SVG demos, the model represents a meaningful step in a few areas:
- Reasoning at half the price: Near-Opus benchmark scores at Gemini 3 pricing. SWE-Bench Verified at 80.6% (Claude is at roughly 80.8%). The gap between “premium” and “affordable” reasoning is collapsing.
- Native multimodal: Text, image, audio, video from the ground up. Claude and GPT still treat multimodal as an extension of text. Gemini treats it as a first-class input.
- Creative coding as benchmark: The SVG generation isn’t a gimmick. It’s becoming a legitimate way to evaluate how well a model understands spatial reasoning, physics, and composition. Expect more “draw me X” benchmarks.
The model is painfully slow at launch. Simon Willison reported 323 seconds for the pelican SVG. A simple “hi” took 104 seconds. Outputs are inconsistent. Same prompt, very different quality. And HN commenters note it shortcuts when tasks have more than five constraints. This is a preview, not production-ready.
The Risk of Polyglot
More models means more complexity. And complexity has costs:
- Context fragmentation: Your conversation history lives in four different systems. No model has the full picture. You become the integration layer, manually copying context between tools.
- Moving targets: Today’s specializations aren’t stable. Claude’s multimodal capabilities are improving. Gemini’s code quality is catching up. The slot each model fills could shift with the next release.
- Vendor surface area: More API keys, more billing, more rate limits, more breaking changes. Every model you depend on is a dependency you maintain.
- The “good enough” trap: Sometimes a single model that’s 80% as good at everything beats a stack of specialists. The overhead of routing and context-switching can eat the gains from specialization.
The developers who are making the polyglot stack work tend to have clear boundaries: Claude Code for the development session, Gemini for visual tasks that arrive mid-flow, GPT for the initial planning doc. The ones who struggle are trying to use all four models within a single task, losing context at every handoff.
Where This Goes
The polyglot model stack is a transitional pattern. It works because models have distinct strengths and developers are willing to manage the complexity manually. Both of those conditions are temporary.
Models are converging. Gemini’s code quality improves with each release. Claude’s multimodal is expanding. GPT keeps absorbing capabilities from specialized tools. The specialization gaps that make the polyglot stack worthwhile are narrowing.
At the same time, tooling is catching up. I’ve been living this pattern for months via a plugin marketplace I built for Claude Code. /gemini:visual for screenshots and mockups, /codex:review for architecture and code review, Claude for everything else. I don’t think about which model to use anymore. The slash commands handle the routing. Multi-model orchestration is already here for anyone willing to wire it up. The question is when it becomes the default.
Right now, in February 2026, a pelican is riding a bicycle in pure SVG, and developers are calmly slotting it into the visual layer of their AI stack while Claude handles the backend and GPT writes the design doc.
Nobody picked a winner. They picked a team.


