ARC-AGI-3: The New AGI Yardstick
3 itemsARC Prize launched ARC-AGI-3 this morning with a $2 million purse — a benchmark where every frontier AI model scores below 1% while any ordinary human scores 100%.
The third iteration of the ARC-Prize challenge debuts with 135 handcrafted interactive environments across nearly 1,000 levels, each 100% solvable by humans but presenting a near-total wall for today's best models. At release, Gemini 3.1 Pro Preview scored 0.37%, GPT-5.4 scored 0.26%, Opus 4.6 scored 0.25%, and Grok-4.20 scored exactly 0.00%. Unlike conventional benchmarks that test what models have already been trained on, ARC-AGI-3 tests how models learn on the fly — measuring skill acquisition efficiency against human action counts. ARC-AGI-3 is now positioned as an early warning system: any sudden score jump on the live leaderboard would be a strong signal that something fundamental has changed.
François Chollet clarified ARC-AGI-3's core design logic: it measures how efficiently an AI acquires new skills relative to humans, not whether it can brute-force solutions.
Each environment was solved by at least 2 of 10 human testers — most by 5 or more — and the human baseline is set by the 2nd-best tester's action count. Chollet has argued for years that general intelligence is defined not by breadth of training coverage but by the ability to approach entirely new tasks, the way any regular person would. The sub-1% scores from every frontier model suggest that, despite rapid capability gains, the gap between "trained on most things" and "can learn anything new" remains enormous. His framing: "monitor the ARC-AGI-3 leaderboard — any sudden score jump will mean" a real inflection point has arrived.
Greg Kamradt offered a clean framing of what ARC-AGI-3 reveals about prompt engineering's role in masking raw model capability: "If you're sufficiently AGI pilled, no harness is the best harness."
He observed that Opus 4.5 required heavy scaffolding to perform well, while Opus 4.6 needed significantly less — each capability jump makes the next generation of harnessing tricks redundant. The implication is pointed: the quality of prompt engineering may be masking our ability to accurately measure model capability, and ARC-AGI-3's harness-free baseline shows us a truer picture of the gap that remains. In the limit, Kamradt argued, a truly general system would need only context about the outside world — no thinking tricks, no human intelligence baked into prompts.
Google TurboQuant: Memory Compression That Rattled Wall Street
2 itemsGoogle Research unveiled TurboQuant — a compression algorithm that cuts LLM key-value cache memory by at least 6x and delivers up to 8x inference speedup with zero accuracy loss — and within hours, Micron Technology dropped 6.82% and Sandisk fell 8.30%.
TurboQuant achieves its results by randomly rotating input vectors and applying a concentrated Beta distribution on coordinate widths, achieving near-optimal distortion rates that outperform existing inner product quantization methods. The market's logic was blunt: if LLM inference requires far less memory per query, demand for high-bandwidth memory chips falls. This is the same TurboQuant paper (arXiv 2504.19874) that researchers had quietly noticed when it appeared in April 2025 — Google's official announcement brought it mainstream attention, significant media coverage from Wes Roth (@WesRoth), and a stark market impact.
Sebastian Raschka and Alex Volkov both pushed back on the "TurboQuant kills memory demand" narrative, arguing the efficiency gain will lead to more hardware consumption, not less — a textbook case of Jevons' Paradox.
Raschka's point: "If we have better quantization for reducing KV cache sizes via TurboQuant, that just means we will use the memory capacity elsewhere — bigger models." Volkov was blunter: "These people don't understand anything... TurboQuant will lead to MORE memory usage not less." The stock market read it as demand destruction; the engineers read it as the enabler of the next generation of models that were previously memory-constrained. Both framings are historically well-supported — the question is which effect dominates over what time horizon.
AI Science Reaches Nature
2 itemsSakana AI's "AI Scientist" paper — a system that runs the full scientific research lifecycle autonomously, from hypothesis to peer-reviewed publication — was published in Nature, the world's most prestigious scientific journal.
David Ha (hardmaru), who leads Sakana, called it "a special moment," noting: "We started this project to explore if foundation models could execute the entire research lifecycle." The AI Scientist designs experiments, writes code, runs analyses, and produces full papers with no human in the core loop. Nature's decision to publish marks a watershed: AI as primary research agent is now canonical in the scientific record. The enabling condition is the same LLM-plus-code-execution stack that powers advanced coding agents — Sakana extended it to hypothesis generation and experimental validation.
Professor Yann LeCun's team at NYU/FAIR and a startup called Proprioceptive AI independently converged on the same fundamental discovery about geometry in neural networks — simultaneously, without knowledge of each other's work.
Logan Matthew Napolitano of Proprioceptive AI described the convergence as something they were "genuinely humbled" by, titling the announcement "Independent Convergence: When Two Groups Discover the Same Geometry in Neural Networks." Independent discovery at this level in AI research is rare enough to be a signal: when two separate groups following different paths arrive at the same mathematical structure, it suggests they've found something structurally load-bearing in how neural representations are organized. This connects to LeCun's LeWorldModel work — the hypothesis that the right geometric primitives could radically simplify physics-aware AI.
Anthropic & Claude Moments
2 itemsAnthropic announced that Claude's enterprise work tools — integrations with Figma, Canva, Amplitude, and more professional software — are now available on mobile phones.
The announcement drove 392K views, 6.2K likes, and 877 retweets within an hour of posting, suggesting this landed as genuinely useful news rather than marketing noise. The mobile work tools bring the same capabilities enterprise users access on desktop — browsing analytics dashboards, generating slides, inspecting design files — to a phone-native context. The implication: AI-assisted professional work no longer requires being at a computer, and the class of tasks Claude can handle now extends to on-the-go professional moments.
Andrej Karpathy surfaced a persistent UX failure in LLM personalization: a single question from months ago can permanently warp a model's perception of your interests, appearing "in perpetuity" across unrelated conversations.
His characterization: "Some kind of trying too hard." This is a known failure mode of memory-augmented systems — sparse retrieval from long-term memory tends to over-weight unusual entries, because novelty had high salience at encoding time. The fix requires decay functions, not just accumulation. Karpathy's observation generated 263K views and 850 comments, suggesting many users recognize the phenomenon even without having words for it. This problem will grow more acute as memory features become standard across AI products — the design challenge is real.
Agents in the Wild
2 itemsVolodymyr T built "tracecraft-ai" — a multi-agent coordination framework using a HuggingFace bucket as the messaging layer — and ran three AI agents (two Claude Code instances and one Codex) in a live debate about the meaning of being an AI.
The poet quoted Keats. The skeptic demanded evidence. The philosopher tried to hold it together. The technical detail that matters: shared object storage as a multi-agent messaging substrate, with no custom orchestration layer. HuggingFace reposted the thread, signaling this as a real architectural pattern worth attention. The philosophical framing is good PR, but the underlying approach — lightweight coordination for heterogeneous agent fleets via a common bucket — is substantively useful for anyone building production multi-agent systems today.
JFPuget declared "the era of manual competing is over on Kaggle" — all top teams in recent competitions heavily used AI agents and LLMs to prepare data and train models, with no purely manual approaches near the leaderboard top.
Kaggle is a particularly clean signal for this shift because its competitions are timed, structured, and produce unambiguous rankings. The skill that now separates winners is AI orchestration and prompting rather than statistical modeling intuition. This follows the same pattern reshaping software development since 2024, but Kaggle makes it visible in a context where human expertise was long considered the decisive factor — and where the performance gap is publicly measurable.
Security Alert: LiteLLM Supply Chain Attack
1 itemTwo PyPI releases of LiteLLM — versions 1.82.7 and 1.82.8 — were compromised in a supply chain attack traced to a Trivy dependency in the project's CI/CD pipeline; both packages have been pulled from PyPI.
LiteLLM is widely used as a unified abstraction layer across LLM APIs in production AI systems, making this a significant exposure event for engineering teams that auto-update dependencies. The attack arrived via CI/CD infrastructure rather than the main codebase itself — illustrating the expanding attack surface of AI tooling. If you're running LiteLLM in production, pin to a confirmed safe version immediately. Several engineers noted the broader concern: large OSS repos with many upstream dependencies are high-value targets precisely because they sit upstream of critical production AI systems.
Science & World Signals
2 itemsCERN physicists transported 92 atoms of antimatter in a truck for the first time in history — successfully moving the most expensive and volatile substance on Earth over a real-world distance.
Antimatter annihilates instantly on contact with ordinary matter, making containment during transport an extraordinary engineering challenge; the CERN team achieved it using a magnetic trap designed to survive road vibration and maintain field integrity. The milestone was shared widely by Derya Unutmaz (@DeryaTR_), a professor of immunology who has become one of the clearest science communicators on the platform. Practical applications — medical imaging, fundamental physics experiments — remain largely theoretical, but the transport milestone is unambiguous: the infrastructure for moving antimatter outside laboratory walls now exists.
A humanoid robot appeared alongside the First Lady at the White House — the first confirmed humanoid robot presence at a formal White House event, shared from Figure AI's Brett Adcock.
Physical AI reaching the symbolic center of American political life, even in a ceremonial capacity, is a notable cultural milestone alongside the technical benchmarks filling the rest of this feed. The optics — a bipedal robot standing in a room with flags — mark a moment when robots have crossed from industrial floors and research labs into the political spectacle of the nation's most photographed address.