Radar trends by Loukides

DeepSeek-V3 is another LLM to watch. Its performance is on a par with Llama 3.1, GPT-4o, and Claude Sonnet. While training was not inexpensive, the cost of training was estimated to be roughly 10% of the bigger models.

Not to be outdone by Google, OpenAI previewed its next models: o3 and o3-mini. These are both “reasoning models” that have been trained to solve logical problems. They may be released in late January; OpenAI is looking for safety and security researchers for testing.

Not to be outdone by 12 Days of OpenAI, Google has released a new experimental model that has been trained to solve logical problems: Gemini 2.0 Flash Thinking. Unlike OpenAI’s GPT models that support reasoning, Flash Thinking shows its chain of thought explicitly.

Jeremy Howard and his team have released ModernBERT, a major upgrade to the BERT model they released six years ago. It comes in two sizes: 139M and 395M parameters. It’s ideal for retrieval, classification, and entity extraction, and other components of a data pipeline.

AWS’s Bedrock service has the ability to check the output of other models for hallucinations.

To make sure they aren’t outdone by 12 Days of OpenAI, Google has announced Android XR, an operating system for extended reality headsets and glasses. Google doesn’t plan to build their own hardware; they’re partnering with Samsung, Qualcomm, and other manufacturers.

Also not to be outdone by 12 Days of OpenAI, Anthropic has announced Clio, a privacy- preserving approach to finding out how people use their models. That information will be used to improve Anthropic’s understanding of safety issues and to build more helpful models.

Not to be outdone by 12 Days of OpenAI, Google has announced Gemini 2.0 Flash, a multimodal model that supports streaming for both input and output. The announcement also showcased Astra, an AI agent for smartphones. Neither is generally available yet.

OpenAI has released canvas, a new feature that combines programming with writing. Changes to the canvas (code or text) immediately become part of the context. Python code is executed in the browser using Pyodide (Wasm), rather than in a container (as with Code Interpreter).

Stripe has announced an agent toolkit that lets you build payments into agentic workflows. Stripe recommends using the toolkit in test mode until the application has been thoroughly validated.

Simon Willison shows how to run a GPT-4 class model (Llama 3.3 70B) on a reasonably well-equipped laptop (64GB MacBook Pro M2).

As part of their 12 Days of OpenAI series, OpenAI finally released their video generation model, Sora. It’s free to ChatGPT Plus subscribers, though limited to 50 five-second video clips per month; a ChatGPT Pro account relaxes many of the limitations.

Researchers have shown that advanced AI models, including Claude 3 Opus and OpenAI o1, are capable of “scheming”: working against the interests of their users to achieve their goals. Scheming includes subverting oversight mechanisms, intentionally delivering subpar results, and even taking steps to prevent shutdown or replacement. Hello, HAL?

Roaming RAG is a new technique for retrieval augmented generation that finds relevant content by searching through headings to navigate documents —like a human might. It requires well-structured documents. A surprisingly simple idea, really.

Google has announced PaliGemma 2, a new version of its Gemma models that incorporates vision.

GPT-4-o1-preview is no more; the preview is now the real thing, OpenAI o1. In addition to advanced reasoning skills, the production release claims to be faster and to deliver more consistent results.

A group of AI agents in Minecraft behaved surprisingly like humans —even developing jobs and religions. Is this a way to model how human groups collaborate?

One thing the AI industry needs desperately (aside from more power) is better benchmarks. Current benchmarks are closed, easily gamed (that’s what AI does), and unreproducible, and they may not test anything meaningful. Better Bench is a framework for assessing benchmark quality.

Palmyra Creative, a new language model from Writer, promises the ability to develop style so that all AI-generated output won’t sound boringly the same.

During training AI picks up biases from human data. When humans interact with the AI, there’s a feedback loop that amplifies those biases.

—Mike Loukides

Search This Blog

cha tai news

Radar trends by Loukides

Comments

Post a Comment

Popular posts from this blog

Hamza Chaudhry

Perplexity

Alongside AI: Eye vs AI