What Gemma 4 on Your 3060 Looks Like, and the AI Science That's Quietly Getting Very Good

April 2026 has been quietly significant on HuggingFace. The model releases that matter right now aren't the headline-grabbing 400B parameter monsters — they're the ones you can actually run, and the techniques that are making those runs better. Here's what's worth knowing.

Gemma 4 has been the dominant story of the past two weeks, and for good reason. The unsloth quantizations dropped on HuggingFace at almost exactly the same time as Google's official release, which means GGUF versions were available before most people had finished reading the blog post. The model comes in several sizes — E2B, E4B, 26B, and 31B — and they've all been quantized into GGUF format by the time you'd expect to find stable llama.cpp builds. For 3060 users specifically, the numbers are straightforward: E2B at 30 tokens per second in Q4_K_M, E4B at 14 tokens per second, and the 26B at 8 tokens per second with GPU_LAYERS=90. The 31B sits on the edge of what's comfortable on 12GB of VRAM — it works, but you're pushing the tolerance, and your token generation speed reflects that.

What's more interesting than the benchmarks is the uncensored variant situation. The Jiunsong/supergemma4-26b-uncensored-gguf-v2 drop from a few days ago represents the kind of model that would have required either significant manual abliteration or a dedicated fine-tune six months ago. Now it's a drop-and-run situation. Whether that's a good thing depends entirely on your use case, but for research purposes the uncensored variants are substantially more useful for understanding what models actually know versus what they've been trained to say.

On the research side, the most interesting paper from this week's arxiv is one that hasn't gotten nearly enough attention: Reasoning Graphs (arXiv:2604.07595). The problem it addresses is intuitive once you think about it. Language model agents — the kind that use chain-of-thought reasoning to work through complex queries — discard all of that reasoning after each run. Same question answered differently each time, same failure modes repeating without any accumulation of insight. Reasoning Graphs fixes this by structuring the chain-of-thought as a graph rather than a linear sequence, where each piece of evidence becomes a node and the reasoning chains are edges that persist across runs. When the same evidence appears in a new query, the system traverses all prior evaluation edges for that evidence item, surfacing how it was judged before.

The benchmark numbers are real. 47% error reduction compared to vanilla RAG on the same questions at 50%+ evidence profile coverage. On 4-hop questions — the kind that require chaining multiple facts together — accuracy improves by 11 percentage points. In high-reuse settings, where you're asking similar types of questions repeatedly, it achieves what's described as Pareto dominance: highest accuracy, 47% lower cost, and 46% lower latency simultaneously. All of this without any retraining. The base model stays frozen. Every gain comes from context engineering via graph traversal.

What that means practically: this is the architecture pattern you'd use if you were building a research agent that works over months rather than minutes. Not a one-shot Q&A system, but something that learns from every query it ever processes and brings that accumulated structure to the next one. The evidence-centric feedback loop is the key insight — instead of retrieving similar past queries (which degrades as vocabulary diverges), you retrieve how specific evidence was judged in prior contexts, which is far more stable and useful.

The other paper worth noting from this cycle is PaperOrchestra from Google AI Research (arXiv:2604.05018). It's a multi-agent framework for automated research paper writing that transforms unstructured research materials into LaTeX manuscripts with literature synthesis and generated figures. They built a benchmark called PaperWritingBench from 200 reverse-engineered top-tier AI conference papers, and PaperOrchestra outperforms autonomous baselines by 50-68% on literature review quality alone. It's not that these systems are writing great papers — they're not, not yet — but they're getting meaningfully better at the literature synthesis component, which is the tedious part that takes humans the longest.

The through-line connecting all of this: AI capability isn't just scaling anymore. It's architecture, it's memory, it's the structure you wrap around the model. The Gemma 4 numbers on your 3060 are real. The Reasoning Graphs paper is real. The gap between what you can run locally and what's being published in research is narrowing faster than most people realise.